Word embeddings: From the ground up
Intended Audience: Anyone who’s interested in machine learning for text processing and doesn’t mind a little math.
TL;DR: A fromscratch introduction to embeddings, the meat and potatoes of modern NLP.
Notebook version available here.
The last 5 or so years have seen huge leaps in machine learning for language processing. We can feel it in the quality of services Google Translate or text recommendation engines like autocomplete, but the big improvements have been in methods further down the technological stack.
Word Embeddings
Word embeddings form the base of much of the advances in Natural Language Processing (NLP). The embeddings transpose text data into numerical vectors, which opens up a huge mathematical toolbox for us to play with. The classic magic trick to impress this on people is to do algebra on words, like this:
(Paris  France) + Russia = X
And X
should equal Moscow
.[footnote]Look at the accompanying notebook to see this in practice[/footnote] But how do we get from text data to vectors of numbers representing the “meaning” of the words?
Concept 1: Bag of words
To follow along, let’s start with a dataset. This csv file contains user reviews for a few apps on the Apple app store. [footnote]In the project folder of the notebook, you can find a tool to generate similar datasets using the Apple app store’s JSON api[/footnote]
The simplest way to transform the reviews to numerical data is to use the bag of words trick. To do this, we make a table where each review is a row, and each column represents one of all the individual words that appeared in all the reviews:
Review  Word 1  Word 2  …  Word 8734763 

Review 1  1  0  …  0 
Review 2  0  1  …  0 
…  …  …  …  … 
Review 198273  0  1  …  0 
Obviously, this has a few drawbacks.
First, if you want to evaluate a new review that contains a word you haven’t encountered before, the model fails.[footnote]This matters given app store users find creative new spellings for all common words[/footnote] Second, this tends to produce absolutely huge matrices which become a pain to work with effectively.[footnote]Notice I use sparse matrices and raw linear algebra to model with bag of words in the notebook, this is a common workaround[/footnote] Third, there’s no concept of word importance – the word “the” is as important as the word “love” or “terrible”.
Concept 2: Word representations
The big problem with bag of words is that the representation is tied to our data. We want to represent words themselves. The solution to this is to assume that words appearing together in a sentence means they are related somehow. This is called the distributional hypothesis and is the workhorse of word embeddings.
There are many ways to exploit the distributional hypothesis to generate embeddings (popular ones are word2vec and GLoVE) but today we want to build them ourselves with the absolute simplest method. This can be done simply in two steps:

Build a symmetric matrix with rows and columns representing individual words, and the data counts how often words appeared together in a sentence in our data. This is called the cooccurence matrix.

Compress the columns of matrix to a reasonably lower number of dimensions.
The resulting compressed matrix has rows representing a word’s numerical vector embedding. The compressed columns don’t really “mean” anything anymore, they just specify a point in the abstract high dimensional space we created where the word lives.
Building the cooccurence matrix is simple if we already have the bagofwords matrix. If you remember the how matrix multiplication works, you can observe that X'X
creates the cooccurence matrix from the bag of words matrix.
Using this, one simple way to compress is to use Principle Component Analysis though any dimensionality reduction method works for us, really.
The attentive readers might notice that we have information on how common words are in our text corpus on the diagonal of the cooccurence matrix. We can exploit this to increase the quality of the embeddings in a few ways before compression, notably exploiting it for tfidf weighing.
… And we’re done! We’ve produced (admittedly bad) vector embeddings for words!
What’s next?
First, we can explain the magic trick at the start of the post. We actually perform the algebra on the words (eg. take the embedding of “Paris”, substract the vector representing “France” then add "Russia). Then we look for the closest vector in our database of word embeddings and call that the result. NLP practitioners tend to use cosine distance – basically the angle between the two vectors – to measure how far apart two embeddings are. To efficiently search the huge list of word vectors is a well known problem with the simple solution being a kd tree[footnote]Protip: If you normalize the embedding vectors, the euclidean metric is the cosine distance, which avoids using the more complex Ball Tree method[/footnote], but there are fancier methods.
The common use of embeddings is to download some pretrained from a huge text corpus and plug them into a model – you can now add text data to any normal statistical model with ease!
The more advanced embeddings use full on models to create embeddings for words in their context, but for most tasks, the increase in precision isn’t worth the huge added complexity.
To embed sentences instead of words, the simple method is to simply sum, or take the average of the vectors composing the sentence. You can do the same with sentence embeddings to create paragraph embeddings, and so on. I find it generally unnecessary to use models to generate the sentence embeddings – instead use this method is extremely difficult to beat and simply creates sentence embeddings from linear combinations of the word embeddings.
If you’re interested in using more NLP and know some python, I highly recommend this free course