Term relations with Word2Vec

In this example, we will be analyzing movie discussions and create a trivial synonym engine. Our engine is based on Word2Vec, a family of shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. In essence, Word2Vec attempts to understand meaning and semantic relationships among words.

We will be using the Spark machine learning package to implement our synonyms service. Spark Machine Learning comes in two flavours:

Since Spark 2.0, MLlib is in maintenance mode, meaning that no new features are implemented for it. Therefore, for new projects, it should be avoided. Some features of MLlib are yet to be ported to SparkML, and the documentation is better for MLlib.

For the remaining of the tutorial, we will be using the SparkML variant.

The dataset we will be using comes from Kaggle; the full dataset is available at this location.

Loading the data

We load the data as an RDD file. As the data contains HTML code, we need to clear it out. We also need to remove punctuation marks and lower case all our words. This will make our input vocabulary much smaller and therefore Word2Vec will not need to use too big vectors.

Let's check what our raw data looks like

Converting the data to a Dataframe

Since SparkML is based on Dataframes, we need to convert our source RDD to a suitable Dataframe. To do so, we first create a schema, consisting of a Sequence of fields that contain Arrays of Strings :-)

Remember that Word2Vec treats text as a bag of words; a bag of word representation on a computer is an Array of Strings.

Removing stopwords

In the dataframe above, we have lots of words that are repeating: think for example articles ('a', 'the'), prepositions (at, on, in) etc. Those words do not add much information to our dataset. You can get an intuitive understanding about this fact by trying to remove those words from everyday sentences: for example, "a cat is under the table" can be converted to "cat is under table" or even to "cat is table" and still get the idea.

To increase the information density of our vectors, we can remove stopwords with StopWordsRemover transformer. We do so in a non-distructive manner; we add a new column in our Dataframe where the contents of our input text have been processed to remove stopwords.

Training the model

We are now ready to train our model!

To exclude the long tail of words that do not appear frequently, we remove words will less than 10 appearences in our dataset.

Out of the box, the Word2Vec API only allows us to check related for a single word. Let's give it a try:

What we see is that Word2Vec actually managed to uncover some related terms given a popular name in the dataset. What is more interesting however, is to see whether we can extract meaningfull terms with respect to a provided phrase. For this, we need to use Word2Vec's findSynonyms(s: Vector) function.

To do so, we first define a function toDF that converts an input string to a vector representation suitable for searching; this basically just tokenizes an input string and converts it to a Spark Dataframe (hence the name).

We then call the transform method on the created Dataframe; this converts our Dataframe to a vector representation using the same vocabulary as our corpus.

To automate the steps above, we create a method that takes a query (as String) and prints the 10 most relevant terms in our model, excluding terms that are included in the query itself.

Checking analogies

One of the nice side effects of being able to uncover latent meanings with tools like Word2Vec is being able to solve analogy problems. In the original Word2Vec paper, the authors show that, when trained on a sufficiently large corpus (billions of items), Word2Vec models can uncover relationships such as:

v(king) - v(man) + v(woman) =~ v(queen)

or, otherwise put: Man is to a king what Woman is to a queen (i.e. their gender). This works simply by performing algebraic vector operations on transformed vector reprensetations of words.

To check whether our model can uncover such relationships as well, we first implement a few simple vector operations.

Then, we implement our analogy function; it returns the Euclidean distance between the vector differences between the entered terms as pairs: