Why did I want to read it?

I wanted to study it because of two things. One is that I was entering Seedtag, where I expect a heavy NLP part for websites (and thus, I wanted to understand the implications on the Data Engineering part of this), and also to see if there is something interesting to do (again focusing on NLP) with my Obsidian vault (Second brain stats (deep focus)).

What did I get out of it?

Preface

the democratization of AI wasn’t what I had in mind. I had been doing research in machine learning for several years and had built Keras to help me with my own experiments (p. xviii)

…which is what Andy Matuschak says in How Can We Develop Transformative Tools for Thought:

There’s a general principle here: good tools for thought arise mostly as a byproduct of doing original work on serious problems. They tend either be created by the people doing that work, or by people working very closely to them, people who are genuinely bought in A related argument has been made in Eric von Hippel’s book “Democratizing Innovation” (2005), which identifies many instances where what appears to be commercial product development is based in large or considerable part on innovations from users. (View Highlight)

Link to original

AI is overhyped according to Chollet and it has happened before:

We may be currently witnessing the third cycle of AI hype and disappointment, and we’re still in the phase of intense optimism. (p. 12)

but it’s a good moment for “integrators”…

Most of the research findings of deep learning aren’t yet applied, or at least are not applied to the full range of problems they could solve across all industries. (p. 12)

1. What is deep learning?

This is a prominent idea on the book: deep learning is about engineering, with little theory, which is critisized (see Gradient Dissent):

Unlike statistics, machine learning tends to deal with large, complex datasets (such as a dataset of millions of images, each consisting of tens of thousands of pixels) for which classical statistical analysis such as Bayesian analysis would be impractical. As a result, machine learning, and especially deep learning, exhibits comparatively little mathematical theory—maybe too little—and is fundamentally an engineering discipline. (p. 4)

If machine learning is about learning useful representation of data along rules, for example, for a classification task:

a new take on learning representations from data that puts an emphasis on learning successive layers of increasingly meaningful representations. (p. 7)

But if you are working with nonperceptual data, grandient boosting trees are usually the best algorithm (p. 16):

From 2016 to 2020, the entire machine learning and data science industry has been dominated by these two approaches: deep learning and gradient boosted trees (…) These are the two techniques you should be the most familiar with (p. 19)

Advantage of deep learning

La madre del cordero for a Data Eng.

The main advantage of deep learning, apart from performance? No need for feature engineering! Feature engineering can be seen as a way to help shallow learning methods to achieve refined representations, but deep learning can do it on its own:

Deep learning removes the need for feature engineering, replacing complex, brittle, engineering-heavy pipelines with simple, end-to-end trainable models that are typically built using only five or six different tensor operations (p. 24)

Also, they can be used in continuous learning and can be repurposed for different tasks.

Why now

Apart from the obvious (hardware improvements), better gradient propagation, which allowed arbitrarily deep models. Before:

The feedback signal used to train neural networks would fade away as the number of layers increased (p. 22)

2. The mathematical building blocks of neural networks

Geometric interpretation of tensor operations

The key math, the most generic transformation:

Affine transform: An affine transform (see figure 2.12) is the combination of a linear transform (achieved via a dot product with some matrix) and a translation (achieved via a vector addition) (…) that’s exactly the y = W • x + b computation implemented by the Dense layer! (p. 46)

But you need activations for non-linearity:

An important observation about affine transforms is that if you apply many of them repeatedly, you still end up with an affine transform (…) As a consequence, a multilayer neural network made entirely of Dense layers without activations would be equivalent to a single Dense layer.

11. Deep learning for text

Treating words as sets

Bag of words with n-grams

“the cat sat on the mat” ⬇️ tokenization (“2gram”) {“the”, “the cat”, “cat”, “cat sat”, “sat”, “sat on”, “on”, “on the”, “the mat”, “mat”} ⬇️ indexing of X most frequent tokens (e.g., 20000) {3, 26, 65, 9, …} ⬇️ multi-hot encoding in eank-1 vector of 20000 dimensions.

”cat""the mat""car”…19996 other tokens…“other”
1100s/1s0
⬇️batching. (e.g., inputs.shape: (32, 20000))

N-grams, as opposed to single word tokenization, allow to capture a bit of info in word order.

Counting with TD-IDF normalization

Instead of just indicating the presence of a token, we can count it and normalize with TD-IDF: “Term frequency, inverse document frequency”.

TD-IDF = term frequency in current document / log(term frequency over the dataset)

We normalize to avoid the non-informative fact that some words simply will appear more than others (no matter what the text is about).

We don’t do it subtracting the mean and dividing by std to avoid breaking the sparsity of the vectors (reduce compute load and reduce the risk of overfitting as compared to a dense vector with the same number of dimensions. The first effect is clear, the reason behind the second, I’m not entirely sure: is it because it acts as a regularization of sorts, virtually reducing the number of features to learn?).

Treating words as sequences

Embeddings
Word embeddings

Compared to one-hot encodings, these are:

  • Dense and lower-dimensional.
  • Their geometric relationship reflect their semantic relationship.
  • Need to be learn from data.

Interestingly:

some semantic relationships between these words can be encoded as geometric transformations. For instance, the same vector allows us to go from cat to tiger and from dog to wolf: this vector could be interpreted as the “from pet to wild animal” vector. Similarly, another vector lets us go from dog to cat and from wolf to tiger, which could be interpreted as a “from canine to feline” vector. (p. 330)

Next: How is the embedding layer? See RNN W2L04 : Embedding matrix - YouTube

Context-aware word embeddings
Sentence embeddings