designates my notes. / designates important.
This book is by no means “the definitive guide” to machine learning. It is more a kind of starter kit, and even that might be pushing it. It assumes you already have some foundational skills in the mathematics, though it does give some light insight.
The code/examples started off great, but they get wonky fast. Sometimes the names of variables seem to change midway through an example, where ’target' becomes ’labels’ and such. Other parts of the code don’t work period. It could be because the book is a little old (2013), in computer years.
the array comparison features[:,2] < #2 works, but not when used as
if feature[:,2] < 2:
print "yes"
else:
print "no"
The datasets are not included and I had some difficulty finding some of them. For example the 20newsgroups data is easy to find, but does not work with the code in the book. Actually, I think the book has a companion site, but you need to log in/verify purchase to access the data.
The further I get the less complete code there is. Playing with the example code is how I get a better understanding of what is actually going on, without it I am at a disadvantage. It will also show plot after plot with no code showing of how they were generated. It does give a description, but given that I am a beginner, I was often unable to understand exactly what they were doing.
I am disappointed in the book overall. I don’t want to put it down, it simply didn’t fulfill what I wanted it to fulfill. I want a nice hand-holding experience so that I can see exactly what the code is along with line by line explanations of what is happening. This is not that book. I don’t think what I want will be found in a 2-300 page book; there is too much information.
On the bright side, it did give me more exposure to the concepts and helped reinforce my grammar on the subject, likely making further forays into machine learning more palatable.
The last chapter switches gears and the book discusses a python library, jug, that can be used for cluster computing. Amazon Web Services are also covered as a quick and easy way to get a computational cluster up and running quickly.
It does have a detailed table of contents, something many other books lack these days.
This is probably not very worthwhile, but here is the little bit of code I tinkered with in relation to this book:
In case you don’t have a conversion tool nearby, you might want to check out sox: http://sox.sourceforge.net. It claims to be the Swiss Army Knife of sound processing, and we agree with this bold claim.
http://metaoptimize.com/qa – This Q&A site is laser-focused on machine learning topics.
You will even find that a simple algorithm with refined data generally outperforms a very sophisticated algorithm with raw data. This part of the machine learning workflow is called feature engineering…
http://metaoptimize.com/qa – This Q&A site is laser-focused on machine learning topics.
a = np.array([0,1,77,3,4,5])
>>> a[np.array([2,3,4])]
array([77, 3, 4])
a = np.array([0,1,77,3,4,5])
>>> a[a>4] = 4
>>> a
array([0, 1, 4, 3, 4, 4])
>>> a.clip(0,4)
array([0, 1, 4, 3, 4, 4])
# let's pretend we have read this from a text file
c = np.array([1, 2, np.NAN, 3, 4])
>>> c
array([ 1., 2., nan, 3., 4.])
>>> np.isnan(c)
array([False, False, True, False, False], dtype=bool)
>>> c[~np.isnan(c)]
array([ 1., 2., 3., 4.])
>>> np.mean(c[~np.isnan(c)])
2.5
def error(f, x, y):
return sp.sum((f(x)-y)**2)
Sepal length
Sepal width
Petal length
Petal width
In general, we will call any measurement from our data as features.
This is the supervised learning or classification problem; given labeled examples, we can design a rule that will eventually be applied to other examples.
we want a high value for a given term in a given value if that term occurs often in that particular post and very rarely anywhere else.
This is exactly what term frequency – inverse document frequency (TF-IDF) does; TF stands for the counting part, while IDF factors in the discounting. A naive implementation would look like the following:
import scipy as sp
def tfidf(term, doc, docset):
tf = float(doc.count(term))/sum(doc.count(w) for w in docset)
idf = math.log(float(len(docset))/(len([doc for doc in docset if term in doc])))
return tf * idf
>>> a, abb, abc = ["a"], ["a", "b", "b"], ["a", "b", "c"]
>>> D = [a, abb, abc]
>>> print(tfidf("a", a, D))
0.0
>>> print(tfidf("b", abb, D))
0.270310072072
>>> print(tfidf("a", abc, D))
0.0
>>> print(tfidf("b", abc, D))
0.135155036036
>>> print(tfidf("c", abc, D))
0.366204096223
We see that a carries no meaning for any document since it is contained everywhere. b is more important for the document abb than for abc as it occurs there twice.
In reality, there are more corner cases to handle than the above example does. Thanks to Scikit, we don’t have to think of them, as they are already nicely packaged in TfidfVectorizer, which is inherited from CountVectorizer. Sure enough, we don’t want to miss our stemmer:
from sklearn.feature_extraction.text import TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(TfidfVectorizer, self).build_analyzer()
return lambda doc: (
english_stemmer.stem(w) for w in analyzer(doc))
vectorizer = StemmedTfidfVectorizer(min_df=1,
stop_words='english',
charset_error='ignore')
Sparsity means that while you may have large matrices and vectors, in principle, most of the values are zero (or so small that we can round them to zero as a good approximation). Therefore, only a few things are relevant at any given time.
Often problems that seem too big to solve are actually feasible because the data is sparse. For example, even though one webpage can link to any other webpage, the graph of links is actually very sparse as each webpage will link to a very tiny fraction of all other webpages.
Precision = True Positive/(True Positive + False Positive)
Recall = True Positive/(True Positive + False Negative)
…we used features that we understood only so much as to know how and where to put them into our classifier setup. The one failed, the other succeeded. The difference between them is that in the second case, we relied on features that were created by experts in the field.
And that is totally OK. If we are mainly interested in the result, we sometimes simply have to take shortcuts—we only have to make sure to take these shortcuts from experts in the specific domains.
Superfluous features can irritate or mislead the learner. This is not the case with all machine learning methods (for example, Support Vector Machines love high-dimensional spaces). But most of the models feel safer with less dimensions.
Another argument against high-dimensional feature spaces is that more features mean more parameters to tune and a higher risk of overfitting.
The data we retrieved to solve our task might just have artificial high dimensions, whereas the real dimension might be small. - Less dimensions mean faster training and more variations to try out,
resulting in better end results.
from import scipy.stats import pearsonr
pearsonr([1,2,3], [1,2,3.1])
#(0.99962228516121843, 0.017498096813278487) #(p,r)
pearsonr([1,2,3], [1,20,6])
#(0.25383654128340477, 0.83661493668227405)
In the first case, we have a clear indication that both series are correlated. In the second one, we still clearly have a non-zero value.
This only works on linear relationships.
A real workhorse in this field is RFE, which stands for recursive feature elimination. It takes an estimator and the desired number of features to keep as parameters and then trains the estimator with various feature sets as long as it has found a subset of the features that are small enough. The RFE instance itself pretends to be like an estimator, thereby, wrapping the provided estimator.
In the following example, we create an artificial classification problem of 100 samples using the convenient make_classification() function of datasets. It lets us specify the creation of 10 features, out of which only three are really valuable to solve the classification problem:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
X,y = make_classification(n_samples=100, n_features=10, n_informative=3, random_state=0)
clf = LogisticRegression()
clf.fit(X, y)
selector = RFE(clf, n_features_to_select=3)
selector = selector.fit(X, y)
print(selector.support_)
# [False True False True False False False False True False]
print(selector.ranking_)
# [4 1 3 1 8 5 7 6 1 2]
{% endfilter %} {% endblock %}