[Home] [Articles, Categories, Tags] [Books, Quotes]
Building Machine Learning Systems with Python
Author:
Pub Year:
Source:
Read: 2018-03-23
Last Update: 2018-03-23

Five Sentence Abstract:

The first, and most important, point that the book tries to instill is that most of your time will be spent on feature engineering - preparing the correct attributes to produce the best results - not on writing or even implementing the other, sexier, aspects of machine learning. As the old adage goes, garbage in, garbage out. Once your data is cleaned up, there are many different algorithms covered that one can use to build a model, clustering, bag-of-words, K-Means, K-Nearest Neighbors, regression, Bayes, or even a combination of several methods, known as an ensemble. The examples are generally applicable to real life problems one might face, text/sentiment analysis, recommendation systems, and audio/visual classifications. It concludes with an introduction to cluster, or cloud, computing that will allow you to take advantage of more powerful processing through the python library 'jug' and to quickly set up your first cluster on Amazon Web Services (AWS).

Thoughts:

This book is by no means "the definitive guide" to machine learning. It is more a kind of starter kit, and even that might be pushing it. It assumes you already have some foundational skills in the mathematics, though it does give some light insight.

The code/examples started off great, but they get wonky fast. Sometimes the names of variables seem to change midway through an example, where 'target' becomes 'labels' and such. Other parts of the code don't work period. It could be because the book is a little old (2013), in computer years.

1
2
3
4
5
the array comparison features[:,2] < #2 works, but not when used as
if feature[:,2] < 2:
  print "yes"
else:
  print "no"

The datasets are not included and I had some difficulty finding some of them. For example the 20newsgroups data is easy to find, but does not work with the code in the book. Actually, I think the book has a companion site, but you need to log in/verify purchase to access the data.

The further I get the less complete code there is. Playing with the example code is how I get a better understanding of what is actually going on, without it I am at a disadvantage. It will also show plot after plot with no code showing of how they were generated. It does give a description, but given that I am a beginner, I was often unable to understand exactly what they were doing.

I am disappointed in the book overall. I don't want to put it down, it simply didn't fulfill what I wanted it to fulfill. I want a nice hand-holding experience so that I can see exactly what the code is along with line by line explanations of what is happening. This is not that book. I don't think what I want will be found in a 2-300 page book; there is too much information.

On the bright side, it did give me more exposure to the concepts and helped reinforce my grammar on the subject, likely making further forays into machine learning more palatable.

The last chapter switches gears and the book discusses a python library, jug, that can be used for cluster computing. Amazon Web Services are also covered as a quick and easy way to get a computational cluster up and running quickly.

It does have a detailed table of contents, something many other books lack these days.

Code

This is probably not very worthwhile, but here is the little bit of code I tinkered with in relation to this book:

boston_houses.py
clustering.py
iris_classification.py
knn.py
recursive_feature_elimination.py
web_traffic.py

Links

In case you don't have a conversion tool nearby, you might want to check out sox: http://sox.sourceforge.net. It claims to be the Swiss Army Knife of sound processing, and we agree with this bold claim.

http://metaoptimize.com/qa – This Q&A site is laser-focused on machine learning topics.

Exceptional Excerpts:

Notes:

Table of Contents

01: Getting Started with Python Machine Learning
02: Learning How to Classify with Real-world Examples
03: Clustering – Finding Related Posts
04: Topic Modeling
05: Classification – Detecting Poor Answers
06: Classification II – Sentiment Analysis
07: Regression – Recommendations
08: Regression – Recommendations Improved
09: Classification III – Music Genre Classification
10: Computer Vision – Pattern Recognition
11: Dimensionality Reduction
12: Big(ger) Data

01: Getting Started with Python Machine Learning

page 9:
page 15:
1
2
3
a = np.array([0,1,77,3,4,5])
>>> a[np.array([2,3,4])]
array([77, 3, 4])
1
2
3
4
a = np.array([0,1,77,3,4,5])
>>> a[a>4] = 4
>>> a
array([0, 1, 4, 3, 4, 4])
1
2
>>> a.clip(0,4)
array([0, 1, 4, 3, 4, 4])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# let's pretend we have read this from a text file
c = np.array([1, 2, np.NAN, 3, 4]) 
>>> c
array([ 1., 2., nan, 3., 4.])
>>> np.isnan(c)
array([False, False, True, False, False], dtype=bool)
>>> c[~np.isnan(c)]
array([ 1., 2., 3., 4.])
>>> np.mean(c[~np.isnan(c)])
2.5
page 16:
page 17:
page 18:

page 22:
1
2
def error(f, x, y):
  return sp.sum((f(x)-y)**2)
page 24:

02: Learning How to Classify with Real-world Examples

page 34:
1
2
3
4
Sepal length
Sepal width
Petal length
Petal width
page 42:
page 43:

03: Clustering – Finding Related Posts

page 59:
page 60:
1
2
3
4
5
import scipy as sp
def tfidf(term, doc, docset):
  tf = float(doc.count(term))/sum(doc.count(w) for w in docset)
  idf = math.log(float(len(docset))/(len([doc for doc in docset if term in doc])))
  return tf * idf
page 61:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
>>> a, abb, abc = ["a"], ["a", "b", "b"], ["a", "b", "c"]
>>> D = [a, abb, abc]
>>> print(tfidf("a", a, D))
0.0
>>> print(tfidf("b", abb, D))
0.270310072072
>>> print(tfidf("a", abc, D))
0.0
>>> print(tfidf("b", abc, D))
0.135155036036
>>> print(tfidf("c", abc, D))
0.366204096223
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from sklearn.feature_extraction.text import TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
  def build_analyzer(self):
    analyzer = super(TfidfVectorizer, self).build_analyzer()
    return lambda doc: (

english_stemmer.stem(w) for w in analyzer(doc))
vectorizer = StemmedTfidfVectorizer(min_df=1,
                                    stop_words='english',
                                    charset_error='ignore')

04: Topic Modeling

page 75:
page 78:

05: Classification – Detecting Poor Answers

page 110:
1
Precision = True Positive/(True Positive + False Positive)
1
Recall = True Positive/(True Positive + False Negative)
page 112:

06: Classification II – Sentiment Analysis

07: Regression – Recommendations

page 154:

08: Regression – Recommendations Improved

09: Classification III – Music Genre Classification

page 182:
page 190:
page 191:
page 197:

10: Computer Vision – Pattern Recognition

page 220:

11: Dimensionality Reduction

page 221:

resulting in better end results.

page 222:
page 223:
1
2
3
4
5
from import scipy.stats import pearsonr
pearsonr([1,2,3], [1,2,3.1])
#(0.99962228516121843, 0.017498096813278487) #(p,r)
pearsonr([1,2,3], [1,20,6])
#(0.25383654128340477, 0.83661493668227405)
page 225:
page 226:
page 227:

page 231:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
X,y = make_classification(n_samples=100, n_features=10, n_informative=3, random_state=0)
clf = LogisticRegression()
clf.fit(X, y)
selector = RFE(clf, n_features_to_select=3)
selector = selector.fit(X, y)
print(selector.support_)
# [False True False True False False False False True False]
print(selector.ranking_)
# [4 1 3 1 8 5 7 6 1 2]
page 233:
page 236:
page 237:

12: Big(ger) Data

page 243:










[About] [Contact]