designates my notes. / designates important.
Overall a great book. Would recommend to anyone interested in learning about machine learning.
It is basically 3 books in 1, separated, unsurprisingly, into beginner, intermediate, and advanced modules.
The detailed table of contents wins points with me every time.
The first book holds your hand and offers a very nice, slow, place to start. Compared to the other pair of beginner books I’ve read on the subject, this one was far superior.
All of the code works, which I can’t say about the other books, and there is often a line-by-line explanation following each snippet.
While there isn’t much in the way of real mathematical proofs, there is still plenty of math and what the underlying algorithms look like and do. Any understanding will help over simply using sklearn blindly. This said, after you have some intuitive understanding of what to expect when using various algorithms and libraries, you will be better equipped to tackle a more detailed and abstract exploration of the underlying mathematical underpinnings.
The second book goes back over the same ideas as the first book, but in more detail and with added depth. Great reinforcement learning. Practice, practice, practice!
Book three is considerably more advanced. It assumes you have the stuff from books one and two down pat. It uses the Theano and Keras libraries, which I didn’t have installed so I didn’t play with the code. The little bit of the code I did experiment with had numerous errors. This module included more advanced topics than I was ready for, but it was interesting none-the-less to expand my grammar for now.
Finally, there are tons of references to follow up and expand your understanding of whatever topic may suit your fancy. Again, all in all a great place to start if combined with another more mathematically oriented book.
Google’s DeepDream program, which became well-known for its overtrained, hallucinogenic imagery, also uses a convolutional neural network.
“I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail” (Abraham Maslow, 1966) really, Maslow said this? note, I can’t get away from these assholes!
I would like a more detailed explanation of what is going on here, but, since it is matplotlib that I am not understanding, the lack of explanation can be forgiven. It plots the colored regions of the plots; I’m not sure what the meshgrid function is doing.
# plot the decision surface
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
We’ve added some additional notes to the code notebooks mentioning the offline datasets in case there are server errors. https://www.dropbox.com/sh/tq2qdh0oqfgsktq/AADIt7esnbiWLOQODn5q_7Dta?dl=0
The code bundle for the course is also hosted on GitHub at https://github.com/PacktPublishing/Python-Deeper-Insights-into-Machine-Learning
GitHub repository: MasteringMLWithPython/Chapter3/ SdA.py
Neural network theory can be quite complex, thus I want to recommend two additional resources that cover some of the concepts that we discuss in this chapter in more detail:
T. Hastie, J. Friedman, and R. Tibshirani. The Elements of Statistical Learning, Volume 2. Springer, 2009.
C. M. Bishop et al. Pattern Recognition and Machine Learning, Volume 1. Springer New York, 2006.
Y. Bengio. Learning Deep Architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009. Yoshua Bengio’s book is currently freely available at http://www.iro.umontreal. ca/~bengioy/papers/ftml_book.pdf.
Victor Powell and Lewis Lehe provide a fantastic interactive, visual explanation of PCA at http://setosa.io/ev/principal-component-analysis/, this is ideal for readers who are new to the core concepts of PCA or who are not quite getting it.
For a lengthier and more mathematically-involved treatment of PCA, touching on underlying matrix transformations, Jonathon Shlens from Google research provides a clear and thorough explanation at http://arxiv.org/abs/1404.1100.
For a thorough worked example that translates Jonathon’s description into clear Python code, consider Sebastian Raschka’s demonstration using the Iris dataset at http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html.
A solid introduction is provided by Kevin Gurney in An Introduction to Neural Networks.
One good option for an unfamiliar reader is the course notes from Andrej Karpathy’s course: http://cs231n.github.io/convolutional- networks/.
A solid place to start understanding Semi-supervised learning methods is Xiaojin Zhu’s very thorough literature survey, available at http://pages.cs.wisc. edu/~jerryzhu/pub/ssl_survey.pdf.
I also recommend a tutorial by the same author, available in the slide format at http://pages.cs.wisc.edu/~jerryzhu/pub/sslicml07.pdf.
For readers interested in Bayesian statistics, Allen Downey’s book, Think Bayes, is a marvelous introduction (and one of my all-time favorite statistics books): https://www.google.co.uk/#q=think+bayes.
There are many good resources for understanding NLP tasks. One fairly thorough, eight-part piece, is available online at http://textminingonline.com/dive-into- nltk-part-i-getting-started-with-nltk.
If you’re keen to get started, one great option is to try Kaggle’s for Knowledge NLP task, which is perfectly suited as a testbed for the techniques described in this chapter: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1- for-beginners-bag-of-words.
My suggested go-to introduction to feature selection is Ando Sabaas' four-part exploration of a broad range of feature selection techniques. It’s full of Python code snippets and informed commentary. Get started at http://blog.datadive.net/ selecting-good-features-part-i-univariate-selection/.
For readers with an interest in hyperparameter optimization, I recommend that you read Alice Zheng’s posts on Turi’s blog as a great place to start: http://blog.turi. com/how-to-evaluate-machine-learning-models-part-4-hyperparameter- tuning.
I also find the scikit-learn documentation to be a useful reference for grid search specifically: http://scikit-learn.org/stable/modules/grid_search.html.
Perhaps the most wide-ranging and informative tour of Ensembles and ensemble types is provided by the Kaggle competitor, Triskelion, at http://mlwave.com/ kaggle-ensembling-guide/.
For a walkthrough on applying random forest ensembles to commercial contexts, with plenty of space given to all-important diagnostic charts and reasoning, consider Arshavir Blackwell’s blog at https://citizennet.com/blog/2012/11/10/random- forests-ensembles-and-performance-metrics/.
The Lasagne User Guide is thorough and worth reading. Find it at http://lasagne. readthedocs.io/en/latest/index.html.
Similarly, find the TensorFlow tutorials at https://www.tensorflow.org/ versions/r0.9/get_started/index.html.
“For data to become information, it requires some meaningful structure."
“I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail” (Abraham Maslow, 1966)
hyperparameter optimization techniques that help us to fine-tune the performance of our model in later chapters. Intuitively, we can think of those hyperparameters as parameters that are not learned from the data but represent the knobs of a model that we can turn to improve its performance
To obtain accurate results via stochastic gradient descent, it is important to present it with data in a random order, which is why we want to shuffle the training set for every epoch to prevent cycles.
eta may be made to decrease over time in stochastic gradient descent.
Another advantage of stochastic gradient descent is that we can use it for online learning. In online learning, our model is trained on-the-fly as new training data arrives.
A compromise between batch gradient descent and stochastic gradient descent is the so-called mini-batch learning. Mini-batch learning can be understood as applying batch gradient descent to smaller subsets of the training data—for example, 50 samples at a time. The advantage over batch gradient descent is that convergence is reached faster via mini-batches because of the more frequent weight updates. Furthermore, mini-batch learning allows us to replace the for-loop over the training samples in Stochastic Gradient Descent (SGD) by vectorized operations, which can further improve the computational efficiency of our learning algorithm.
1. Selection of features.
2. Choosing a performance metric.
3. Choosing a classifier and optimization algorithm.
4. Evaluating the performance of the model.
5. Tuning the algorithm.
if misclassification error = 0.089
1 - misclassification error = 0.911 or 91.1 percent.
The concept behind regularization is to introduce additional information (bias) to penalize extreme parameter weights. The most common form of regularization is the so-called L2 regularization (sometimes also called L2 shrinkage or weight decay)
Regularization is another reason why feature scaling such as standardization is important. For regularization to work properly, we need to ensure that all our features are on comparable scales.
In practical classification tasks, linear logistic regression and linear SVMs often yield very similar results. Logistic regression tries to maximize the conditional likelihoods of the training data, which makes it more prone to outliers than SVMs. The SVMs mostly care about the points that are closest to the decision boundary (support vectors). On the other hand, logistic regression has the advantage that it is a simpler model that can be implemented more easily. Furthermore, logistic regression models can be easily updated, which is attractive when working with streaming data.
sometimes our datasets are too large to fit into computer memory. Thus, scikit-learn also offers alternative implementations via the SGDClassifier class, which also supports online learning via the partial_fit method. The concept behind the SGDClassifier class is similar to the stochastic gradient algorithm
>>> from sklearn.linear_model import SGDClassifier
>>> ppn = SGDClassifier(loss='perceptron')
>>> lr = SGDClassifier(loss='log')
>>> svm = SGDClassifier(loss='hinge')
One of the most widely used kernels is the Radial Basis Function kernel (RBF kernel) or Gaussian kernel
information gain is simply the difference between the impurity of the parent node and the sum of the child node impurities—the lower the impurity of the child nodes, the larger the information gain. However, for simplicity and to reduce the combinatorial search space, most libraries (including scikit-learn) implement binary decision trees. This means that each parent node is split into two child nodes
the three impurity measures or splitting criteria that are commonly used in binary decision trees are Gini impurity (I_G), entropy (I_H), and the classification error (I_E).
Parametric versus nonparametric models:
Machine learning algorithms can be grouped into parametric and nonparametric models. Using parametric models, we estimate parameters from the training dataset to learn a function that can classify new data points without requiring the original training dataset anymore. Typical examples of parametric models are the perceptron, logistic regression, and the linear SVM. In contrast, nonparametric models can’t be characterized by a fixed set of parameters, and the number of parameters grows with the training data. Two examples of nonparametric models that we have seen so far are the decision tree classifier/random forest and the kernel SVM.
KNN belongs to a subcategory of nonparametric models that is described as instance-based learning. Models based on instance-based learning are characterized by memorizing the training dataset, and lazy learning is a special case of instance-based learning that is associated with no (zero) cost during the learning process.
The main advantage of such a memory-based approach is that the classifier immediately adapts as we collect new training data. However, the downside is that the computational complexity for classifying new samples grows linearly with the number of samples in the training dataset in the worst-case scenario
Furthermore, we can’t discard training samples since no training step is involved. Thus, storage space can become a challenge if we are working with large datasets.
The curse of dimensionality = It is important to mention that KNN is very susceptible to overfitting due to the curse of dimensionality. The curse of dimensionality describes the phenomenon where the feature space becomes increasingly sparse for an increasing number of dimensions of a fixed-size training dataset. Intuitively, we can think of even the closest neighbors being too far away in a high-dimensional space to give a good estimate.
in models where regularization is not applicable such as decision trees and KNN, we can use feature selection and dimensionality reduction techniques to help us avoid the curse of dimensionality.
>>> df.values
array([[ 1., 2., 3., 4.],
[ 5., 6., nan, 8.],
[10., 11., 12., nan]])
>>> df.dropna()
A B C D
0 1 2 3 4
>>> df.dropna(axis=1)
A B
0 1 2
1 5 6
2 10 11
# only drop rows where all columns are NaN
>>> df.dropna(how='all')
# drop rows that have not at least 4 non-NaN values
>>> df.dropna(thresh=4)
# only drop rows where NaN appear in specific columns (here: 'C')
>>> df.dropna(subset=['C'])
One of the most common interpolation techniques is mean imputation, where we simply replace the missing value by the mean value of the entire feature column. A convenient way to achieve this is by using the Imputer class from scikit-learn,
The Imputer class belongs to the so-called transformer classes in scikit-learn that are used for data transformation. The two essential methods of those estimators are fit and transform. The fit method is used to learn the parameters from the training data, and the transform method uses those parameters to transform the data. Any data array that is to be transformed needs to have the same number of features as the data array that was used to fit the model.
Now, there are two common approaches to bringing different features onto the same scale: normalization and standardization.
normalization refers to the rescaling of the features to a range of [0, 1], which is a special case of min-max scaling.
Using standardization, we center the feature columns at mean 0 with standard deviation 1 so that the feature columns take the form of a normal distribution, which makes it easier to learn the weights. Furthermore, standardization maintains useful information about outliers and makes the algorithm less sensitive to them in contrast to min-max scaling
Here, μ_x is the sample mean of a particular feature column and σ_x the corresponding standard deviation, respectively.
However, as far as interpretability is concerned, the random forest technique comes with an important gotcha that is worth mentioning. For instance, if two or more features are highly correlated, one feature may be ranked very highly while the information of the other feature(s) may not be fully captured. On the other hand, we don’t need to be concerned about this problem if we are merely interested in the predictive performance of a model rather than the interpretation of feature importances.
scikit-learn also implements a transform method that selects features based on a user-specified threshold after model fitting, which is useful if we want to use the RandomForestClassifier as a feature selector and intermediate step in a scikit-learn pipeline
Feature extraction is typically used to improve computational efficiency but can also help to reduce the curse of dimensionality—especially if we are working with nonregularized models.
Principal component analysis (PCA) is an unsupervised linear transformation technique that is widely used across different fields, most prominently for dimensionality reduction.
PCA helps us to identify patterns in data based on the correlation between features. In a nutshell, PCA aims to find the directions of maximum variance in high-dimensional data and projects it onto a new subspace with equal or fewer dimensions that the original one.
Linear Discriminant Analysis (LDA) can be used as a technique for feature extraction to increase the computational efficiency and reduce the degree of over-fitting due to the curse of dimensionality in nonregularized models.
Both LDA and PCA are linear transformation techniques that can be used to reduce the number of dimensions in a dataset; the former is an unsupervised algorithm, whereas the latter is supervised.
In k-fold cross-validation, we randomly split the training dataset into k folds without replacement, where k −1 folds are used for the model training and one fold is used for testing. This procedure is repeated k times so that we obtain k models and performance estimates.
In case you are not familiar with the terms sampling with and without replacement, let’s walk through a simple thought experiment. Let’s assume we are playing a lottery game where we randomly draw numbers from an urn. We start with an urn that holds five unique numbers 0, 1, 2, 3, and 4, and we draw exactly one number each turn. In the first round, the chance of drawing a particular number from the urn would be 1/5. Now, in sampling without replacement, we do not put the number back into the urn after each turn. Consequently, the probability of drawing a particular number from the set of remaining numbers in the next round depends on the previous round. For example, if we have a remaining set of numbers 0, 1, 2, and 4, the chance of drawing number 0 would become 1/4 in the next turn.
However, in random sampling with replacement, we always return the drawn number to the urn so that the probabilities of drawing a particular number at each turn does not change; we can draw the same number more than once. In other words, in sampling with replacement, the samples (numbers) are independent and have a covariance zero.
In machine learning, we have two types of parameters: those that are learned from the training data, for example, the weights in logistic regression, and the parameters of a learning algorithm that are optimized separately. The latter are the tuning parameters, also called hyperparameters, of a model, for example, the regularization parameter in logistic regression or the depth parameter of a decision tree.
hyperparameter optimization technique called grid search that can further help to improve the performance of a model by finding the optimal combination of hyperparameter values.
Precision (PRE) and recall (REC) are performance metrics that are related to those true positive and true negative rates, and in fact, recall is synonymous to the true positive rate
In practice, often a combination of precision and recall is used, the so-called F1-score:
>>> from sklearn.metrics import make_scorer, f1_score
>>> scorer = make_scorer(f1_score, pos_label=0)
>>> gs = GridSearchCV(estimator=pipe_svc,
... param_grid=param_grid,
... scoring=scorer,
... cv=10)
Receiver operator characteristic (ROC) graphs are useful tools for selecting models for classification based on their performance with respect to the false positive and true positive rates, which are computed by shifting the decision threshold of the classifier. The diagonal of an ROC graph can be interpreted as random guessing, and classification models that fall below the diagonal are considered as worse than random guessing. A perfect classifier would fall into the top-left corner of the graph with a true positive rate of 1 and a false positive rate of 0. Based on the ROC curve, we can then compute the so-called area under the curve (AUC) to characterize the performance of a classification model
ROC AUC and accuracy metrics mostly agree with each other
The scoring metrics that we discussed in this section are specific to binary classification systems. However, scikit-learn also implements macro and micro averaging methods to extend those scoring metrics to multiclass problems via One vs. All (OvA) classification. The micro-average is calculated from the individual true positives, true negatives, false positives, and false negatives of the system. For example, the micro-average of the precision score in a k-class system can be calculated as follows:
Micro-averaging is useful if we want to weight each instance or prediction equally, whereas macro-averaging weights all classes equally to evaluate the overall performance of a classifier with regard to the most frequent class labels.
instead of using the same training set to fit the individual classifiers in the ensemble, we draw bootstrap samples (random samples with replacement) from the initial training set, which is why bagging is also known as bootstrap aggregating.
random forests are a special case of bagging where we also use random feature subsets to fit the individual decision trees.
1. Draw a random subset of training samples d1 without replacement from the
training set D to train a weak learner C1.
2. Draw second random training subset d 2 without replacement from the training
set and add 50 percent of the samples that were previously misclassified to
train a weak learner C2.
3. Find the training samples d 3 in the training set D on which C 1 and C2
disagree to train a third weak learner C3.
4. Combine the weak learners C1 , C2 , and C3 via majority voting.
1. We create a vocabulary of unique tokens—for example, words—from the entire
set of documents.
2. We construct a feature vector from each document that contains the counts
of how often each word occurs in the particular document.
The sequence of items in the bag-of-words model that we just created is also called the 1-gram or unigram model—each item or token in the vocabulary represents a single word. More generally, the contiguous sequences of items in NLP—words, letters, or symbols—is also called an n-gram. The choice of the number n in the n-gram model depends on the particular application; for example, a study by Kanaris et al. revealed that n-grams of size 3 and 4 yield good performances in anti-spam filtering of e-mail messages
raw term frequencies: tf(t,d) — the number of times a term t occurs in a document d.
Linear regression models can be heavily impacted by the presence of outliers.
As an alternative to throwing out outliers, we will look at a robust method of regression using the RANdom SAmple Consensus (RANSAC) algorithm, which fits a regression model to a subset of the data, the so-called inliers.
We can summarize the iterative RANSAC algorithm as follows:
1. Select a random number of samples to be inliers and fit the model.
2. Test all other data points against the fitted model and add those points
that fall within a user-given tolerance to the inliers.
3. Refit the model using all inliers.
4. Estimate the error of the fitted model versus the inliers.
5. Terminate the algorithm if the performance meets a certain user-defined
threshold or if a fixed number of iterations has been reached; go back to step
1 otherwise.
>>> from sklearn.linear_model import Ridge
>>> ridge = Ridge(alpha=1.0)
>>> from sklearn.linear_model import Lasso
>>> lasso = Lasso(alpha=1.0)
>>> from sklearn.linear_model import ElasticNet
>>> lasso = ElasticNet(alpha=1.0, l1_ratio=0.5)
A random forest, which is an ensemble of multiple decision trees, can be understood as the sum of piecewise linear functions
An advantage of the decision tree algorithm is that it does not require any transformation of the features if we are dealing with nonlinear data.
in linkage. However, we should not use the squareform distance matrix that we defined earlier, since it would yield different distance values from those expected. To sum it up, the three possible scenarios are listed here:
Incorrect approach: In this approach, we use the squareform distance matrix. The code is as follows:
>>> from scipy.cluster.hierarchy import linkage
>>> row_clusters = linkage(row_dist,
method='complete',
metric='euclidean')
>>> row_clusters = linkage(pdist(df, metric='euclidean'),
method='complete')
>>> row_clusters = linkage(df.values,
method='complete',
metric='euclidean')
Neural network theory can be quite complex, thus I want to recommend two additional resources that cover some of the concepts that we discuss in this chapter in more detail:
T. Hastie, J. Friedman, and R. Tibshirani. The Elements of Statistical Learning, Volume 2. Springer, 2009.
C. M. Bishop et al. Pattern Recognition and Machine Learning, Volume 1. Springer New York, 2006.
A great collection of Theano tutorials can be found at http://deeplearning.net/ software/theano/tutorial/index.html#tutorial.
There are also a number of interesting libraries that are being actively developed to train neural networks in Theano, which you should keep on your radar:
Pylearn2 (http://deeplearning.net/software/pylearn2/)
Lasagne (https://lasagne.readthedocs.org/en/latest/)
Keras (http://keras.io)
Can not get Theano to install, Scipy was installed via apt and uninstalling it will also uninstall a number of other programs. So, I skimmed the rest of this chapter. It seems straight forward enough.
Ridge regression not only addresses the issue of multicollinearity, but also situations where the number of input variables greatly exceeds the number of samples. The linear_model.Ridge() object uses what is known as L2 regularization. Intuitively, we can understand this as adding a penalty on the extreme values of the weight vector. This is sometimes called shrinkage because it makes the average weights smaller. This tends to make the model more stable because it reduces its sensitivity to extreme values.
The Sklearn object, linear_model.ridge, adds a regularization parameter, alpha. Generally, small positive values for alpha improves the model’s stability. It can either be a float or an array. If it is an array, it is assumed that the array corresponds to specific targets, and therefore, it must be the same size as the target.
Determining what is redundant or irrelevant is the major function of dimensionality reduction algorithms. There are basically two approaches: feature extraction and feature selection. Feature selection attempts to find a subset of the original feature variables. Feature extraction, on the other hand, creates new feature variables by combining correlated variables.
most common feature extraction algorithm, that is, Principle Component Analysis or PCA. This uses an orthogonal transformation to convert a set of correlated variables into a set of uncorrelated variables. The important information, the length of vectors, and the angle between them does not change.
Probably the most versatile kernel function, and the one that gives good results in most situations, is the Radial Basis Function (RBF). The rbf kernel takes a parameter, gamma, which can be loosely interpreted as the inverse of the sphere of influence of each sample. A low value of gamma means that each sample has a large radius of influence on samples selected by the model. The KernalPCA fit_transform method takes the training vector, fits it to the model, and then transforms it into its principle components.
w = (X^T X)^-1 * X^T y
One of the advantages of using the normal equation is that you do not need to worry about feature scaling.
Another advantage of the normal equation is that you do not need to choose the learning rate.
The normal equation has its own particular disadvantages; foremost is that it does not scale as well when we have data with a large number of features. We need to calculate the inverse of the transpose of our feature matrix, X. This calculation results in an n by n matrix. Remember that n is the number of features. This actually means that on most platforms the time it takes to invert a matrix grows, approximately, as a cube of n. So, for data with a large number of features, say greater than 10,000, you should probably consider using gradient descent rather than the normal equation. Another problem that arises when using the normal equation is that, when we have more features than training data, that is, when n is greater than m, the normal equation without regularization will not work. This is because the matrix, XTX, is non-transposable, and so there is no way to calculate our term, (XTX)^-1.
In the one versus all approach, [to multiclass classification] a single multiclass problem is transformed into a number of binary classification problems. This is called the one versus all technique because we take each class in turn and fit a hypothesis function for that particular class, assigning a negative class to the other classes. We end up with different classifiers, each of which is trained to recognize one of the classes.
With another approach called the one versus one method, a classifier is constructed for each pair of classes. When the model makes a prediction, the class that receives the most votes wins.
Sklearn implements the one versus all algorithm using the OneVsRestClassifier /span> class and the one versus one algorithm with OneVsOneClassifier.
Tree-based models are particularly well suited to ensembles, primarily because they can be sensitive to changes in the training data.
builds each tree using a different random subset of the features and is therefore called a random forest.
Generally, in classification tasks, there are three reasons why a model may misclassify a test instance. Firstly, it may simply be unavoidable if features from different classes are described by the same feature vectors. In probabilistic models, this happens when the class distributions overlap so that an instance has non-zero likelihoods for several classes. Here we can only approximate a target hypothesis.
The second reason for classification errors is that the model does not have the expressive capabilities to fully represent the target hypothesis. For example, even the best linear classifier will misclassify instances if the data is not linearly separable. This is due to the bias of the classifier. Although there is no single agreed way to measure bias, we can see that a nonlinear decision boundary will have less bias than a linear one, or that more complex decision boundaries will have less bias than simpler ones. We can also see that tree models have the least bias because they can continue to branch until only a single instance is covered by each leaf.
Bagging is primarily a variance reduction technique and boosting is primarily a bias reduction technique.
Bagging ensembles work most effectively with high variance models, such as complex trees, whereas boosting is typically used with high bias models such as linear classifiers.
The importance of defining a scoring strategy should not be underestimated, and in Sklearn, there are basically three approaches:
Estimator score: This refers to using the estimator’s inbuilt score() method, specific to each estimator
Scoring parameters: This refers to cross-validation tools relying on an internal scoring strategy
Metric functions: These are implemented in the metrics module
Grid search is probably the most used method of optimization hyper parameters, however, there are times when it may not be the best choice. The RandomizedSearchCV object implements a randomized search over possible parameters. It uses a dictionary similar to the GridSearchCV object, however, for each parameter, a distribution can be set, over which a random search of values will be made. If the dictionary contains a list of values, then these will be sampled uniformly. Additionally, the RandomizedSearchCV object also contains an n_iter parameter that is effectively a computational budget of the number of parameter settings sampled. It defaults to 10, and at high values, will generally give better results. However, this is at the expense of runtime.
There are alternatives to the brute force approach of the grid search, and these are provided in estimators such as LassoCV and ElasticNetCV. Here, the estimator itself optimizes its regularization parameter by fitting it along a regularization, path. This is usually more efficient than using a grid search.
This topology makes Boltzmann machines stochastic—probabilistic rather than deterministic
[RBM] is an energy-based model, which means that it uses an energy function to associate an energy value with each configuration of the network.
The RBM is most commonly used as a pretraining mechanism for a highly effective deep network architecture called a DBN [deep belief networks]. DBNs are extremely powerful tools to learn and classify a range of image datasets. They possess a very good ability to generalize to unknown cases and are among the best image-learning tools available.
it has been noted that even if the layers don’t contain very many nodes, with enough layers, more or less any function can be modeled.
The design of convolutional neural networks takes inspiration from the visual cortex
Google’s DeepDream program, which became well-known for its overtrained, hallucinogenic imagery, also uses a convolutional neural network.
Facebook uses convolutional nets in face verification (DeepFace).