Feineigle.com - Building Machine Learning Systems with Python

Home · Book Reports · 2018 · Building Machine Learning Systems With Python

Published: March 23, 2018



The book in...
One sentence:
This book covers the basics of feature engineering and algorithms while giving minor insight into each, but leaves much to be desired when it comes to code examples.

Five sentences:
The first, and most important, point that the book tries to instill is that most of your time will be spent on feature engineering - preparing the correct attributes to produce the best results - not on writing or even implementing the other, sexier, aspects of machine learning. As the old adage goes, garbage in, garbage out. Once your data is cleaned up, there are many different algorithms covered that one can use to build a model, clustering, bag-of-words, K-Means, K-Nearest Neighbors, regression, Bayes, or even a combination of several methods, known as an ensemble. The examples are generally applicable to real life problems one might face, text/sentiment analysis, recommendation systems, and audio/visual classifications. It concludes with an introduction to cluster, or cloud, computing that will allow you to take advantage of more powerful processing through the python library 'jug' and to quickly set up your first cluster on Amazon Web Services (AWS).

designates my notes. / designates important.


Thoughts

This book is by no means “the definitive guide” to machine learning. It is more a kind of starter kit, and even that might be pushing it. It assumes you already have some foundational skills in the mathematics, though it does give some light insight.

The code/examples started off great, but they get wonky fast. Sometimes the names of variables seem to change midway through an example, where ’target' becomes ’labels’ and such. Other parts of the code don’t work period. It could be because the book is a little old (2013), in computer years.

the array comparison features[:,2] < #2 works, but not when used as
if feature[:,2] < 2:
  print "yes"
else:
  print "no"

The datasets are not included and I had some difficulty finding some of them. For example the 20newsgroups data is easy to find, but does not work with the code in the book. Actually, I think the book has a companion site, but you need to log in/verify purchase to access the data.

The further I get the less complete code there is. Playing with the example code is how I get a better understanding of what is actually going on, without it I am at a disadvantage. It will also show plot after plot with no code showing of how they were generated. It does give a description, but given that I am a beginner, I was often unable to understand exactly what they were doing.

I am disappointed in the book overall. I don’t want to put it down, it simply didn’t fulfill what I wanted it to fulfill. I want a nice hand-holding experience so that I can see exactly what the code is along with line by line explanations of what is happening. This is not that book. I don’t think what I want will be found in a 2-300 page book; there is too much information.

On the bright side, it did give me more exposure to the concepts and helped reinforce my grammar on the subject, likely making further forays into machine learning more palatable.

The last chapter switches gears and the book discusses a python library, jug, that can be used for cluster computing. Amazon Web Services are also covered as a quick and easy way to get a computational cluster up and running quickly.

It does have a detailed table of contents, something many other books lack these days.

Code

This is probably not very worthwhile, but here is the little bit of code I tinkered with in relation to this book:

In case you don’t have a conversion tool nearby, you might want to check out sox: http://sox.sourceforge.net. It claims to be the Swiss Army Knife of sound processing, and we agree with this bold claim.

http://metaoptimize.com/qa – This Q&A site is laser-focused on machine learning topics.


Table of Contents


· 01: Getting Started with Python Machine Learning

page 9:
page 15:
a = np.array([0,1,77,3,4,5])
>>> a[np.array([2,3,4])]
array([77, 3, 4])
a = np.array([0,1,77,3,4,5])
>>> a[a>4] = 4
>>> a
array([0, 1, 4, 3, 4, 4])
>>> a.clip(0,4)
array([0, 1, 4, 3, 4, 4])
# let's pretend we have read this from a text file
c = np.array([1, 2, np.NAN, 3, 4]) 
>>> c
array([ 1., 2., nan, 3., 4.])
>>> np.isnan(c)
array([False, False, True, False, False], dtype=bool)
>>> c[~np.isnan(c)]
array([ 1., 2., 3., 4.])
>>> np.mean(c[~np.isnan(c)])
2.5
page 16:
page 17:
page 18:
page 22:
def error(f, x, y):
  return sp.sum((f(x)-y)**2)
page 24:

· 02: Learning How to Classify with Real-world Examples

page 34:
Sepal length
Sepal width
Petal length
Petal width
page 42:
page 43:

page 59:
page 60:
import scipy as sp
def tfidf(term, doc, docset):
  tf = float(doc.count(term))/sum(doc.count(w) for w in docset)
  idf = math.log(float(len(docset))/(len([doc for doc in docset if term in doc])))
  return tf * idf
page 61:
>>> a, abb, abc = ["a"], ["a", "b", "b"], ["a", "b", "c"]
>>> D = [a, abb, abc]
>>> print(tfidf("a", a, D))
0.0
>>> print(tfidf("b", abb, D))
0.270310072072
>>> print(tfidf("a", abc, D))
0.0
>>> print(tfidf("b", abc, D))
0.135155036036
>>> print(tfidf("c", abc, D))
0.366204096223
from sklearn.feature_extraction.text import TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
  def build_analyzer(self):
    analyzer = super(TfidfVectorizer, self).build_analyzer()
    return lambda doc: (

english_stemmer.stem(w) for w in analyzer(doc))
vectorizer = StemmedTfidfVectorizer(min_df=1,
                                    stop_words='english',
                                    charset_error='ignore')

· 04: Topic Modeling

page 75:
page 78:

· 05: Classification – Detecting Poor Answers

page 110:
Precision = True Positive/(True Positive + False Positive)
Recall = True Positive/(True Positive + False Negative)
page 112:

· 06: Classification II – Sentiment Analysis

· 07: Regression – Recommendations

page 154:

· 08: Regression – Recommendations Improved

· 09: Classification III – Music Genre Classification

page 182:
page 190:
page 191:
page 197:

· 10: Computer Vision – Pattern Recognition

page 220:

· 11: Dimensionality Reduction

page 221:

resulting in better end results.

page 222:
page 223:
from import scipy.stats import pearsonr
pearsonr([1,2,3], [1,2,3.1])
#(0.99962228516121843, 0.017498096813278487) #(p,r)
pearsonr([1,2,3], [1,20,6])
#(0.25383654128340477, 0.83661493668227405)
page 225:
page 226:
page 227:
page 231:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
X,y = make_classification(n_samples=100, n_features=10, n_informative=3, random_state=0)
clf = LogisticRegression()
clf.fit(X, y)
selector = RFE(clf, n_features_to_select=3)
selector = selector.fit(X, y)
print(selector.support_)
# [False True False True False False False False True False]
print(selector.ranking_)
# [4 1 3 1 8 5 7 6 1 2]
page 233:
page 236:
page 237:

· 12: Big(ger) Data

page 243:

{% endfilter %} {% endblock %}