Five Sentence Abstract:
This book starts of extolling the virtues of test driven design and the SOLID principles, which, while respectable, seem somewhat forced and out of place in a machine learning book. Next, much of the code is written from scratch, not utilizing libraries and giving a look under the hood. The code itself, while complete, lacks the thorough explanation one would hope for in a book for beginners. The equations underlying the algorithms are presented, but, like the code, lack explanation or proof. All in all the book tries to cover too much in too few pages.
Thoughts:
While there is a lot of code provided, the accompanying descriptions are
lacking. The test driven aspect, in my opinion, gets in the way if you are
trying to learn machine learning.
The code is built, in most cases, without the help of sklearn, except in a few
instances. While I assumed it would be nice to 'look under the hood' at how
some of the underlying functionality handled by libraries would be informative,
I was wrong (at least in regard to this book). With more detailed explanations,
linebyline walktroughs I think this would be a much better book.
As it stands now, it tries to cover too much in too few pages.
Additionally, the mathematics covered is sparse and without proofs or much in
the way of explanations.
Notes:
Table of Contents
01: Probably Approximately Correct Software
02: A Quick Introduction to Machine Learning
03: KNearest Neighbors
04: Naive Bayesian Classification
05: Decision Trees and Random Forests
06: Hidden Markov Models
07: Support Vector Machines
08: Neural Networks
09: Clustering
10: Improving Models and Data Extraction
11: Putting It Together: Conclusion
 // Actual pages numbers from the book.
page 018:
page 026:
page 031:
page 47:
 P(A_1, A_2,.., A_n) = P(A_1)*P(A_2∣A_1)*P(A_3∣A_1, A_2)*...*P(A_n A_1, A_2,.., A_n−1)

page 74:

Precision = True Positives / (True Positives + False Positives)

Precision is a measure of how on point the classification is. For instance,
out of all the positive matches the model finds, how many of them were correct?

Recall = True Positives / (True Positives + False Negatives)

Recall can be thought of as the sensitivity of the model. It is a measure of
whether all the relevant instances were actually looked at.

Accuracy = (True Positives + True Negatives) / (Number of all responses)

Accuracy as we know it is simply an error rate of the model. How well does it
do in aggregate?
page 165:
 Jon Kleinberg, who touts it as the impossibility theorem, which states that
you can never have more than two of the following when clustering:
 1. Richness
2. Scale invariance
3. Consistency


Richness is the notion that there exists a distance function that will yield
all different types of partitions. What this means intuitively is that a
clustering algorithm has the ability to create all types of mappings from data
points to cluster assignments.

Scale invariance is simple to understand. Imagine that you were building a
rocket ship and started calculating everything in kilometers until your boss
said that you need to use miles instead. There wouldn’t be a problem switching;
you just need to divide by a constant on all your measurements. It is scale
invariant. Basically if the numbers are all multiplied by 20, then the same
cluster should happen.

Consistency is more subtle. Similar to scale invariance, if we shrank the
distance between points inside of a cluster and then expanded them, the cluster
should yield The Impossibility Theorem  165 the same result. At this point you
probably understand that clustering isn’t as good as many originally think. It
has a lot of issues and consistency is definitely one of those that should be
called out.
page 183: