NLP with gensim (word2vec)

NLP (Natural Language Processing) is a fast developing field of research in recent years, especially by Google, which depends on NLP technologies for managing its vast repositories of text contents.

In this study unit we will lay a simple introduction to this field through the use of the excellent gensim Python package of Radim Rehurek's and his excellent word2vec Tutorial:
https://rare-technologies.com/word2vec-tutorial.

We have followed Radim's code with some supplements and more examples, and adapted it to an IMDB movie reviews dataset from Cornell university:
https://www.cs.cornell.edu/people/pabo/movie-review-data

You may downlaod this dataset more conveniently from here:
http://www.samyzaf.com/ML/nlp/aclImdb.zip

Load packages

In [2]:
# These are css/html style for good looking ipython notebooks
from IPython.core.display import HTML
css = open('c:/ml/style-notebook.css').read()
HTML('<style>{}</style>'.format(css))
Out[2]:
In [1]:
# -*- coding: utf-8 -*-
import gensim
import logging
import os
import nltk.data
import string
%matplotlib inline

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

print ("PACKAGES LOADED")
C:\Anaconda2\envs\tensorflow-gpu\lib\site-packages\gensim\utils.py:855: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
PACKAGES LOADED

The following class defined a Python generator which parses all files (recursively) in a given directory, and yields the sentences there one at a time (thus saving loads of memory).

In [2]:
class SentGen(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for path,dirs,files in os.walk(self.dirname):
            for fname in files:
                for line in get_sentences(path + '/' + fname):
                    yield line.split()

def get_sentences(fname):
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    fp = open(fname, 'r', encoding="utf-8")
    data = fp.read()
    fp.close()
    trans_table = dict((ord(char), None) for char in string.punctuation)
    sentences = nltk.sent_tokenize(data)
    for sent in sentences:
        yield sent.translate(trans_table)

Create an empty gensim model, no training yet

In [3]:
model = gensim.models.Word2Vec(iter=1, min_count=5)

Build a vocabulary

In [4]:
model.build_vocab(SentGen('aclImdb'), progress_per=200000)
2017-01-28 21:05:25,995 : INFO : collecting all words and their counts
2017-01-28 21:05:26,011 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-01-28 21:05:40,775 : INFO : PROGRESS: at sentence #200000, processed 4235964 words, keeping 117989 word types
2017-01-28 21:05:56,507 : INFO : PROGRESS: at sentence #400000, processed 8470628 words, keeping 177074 word types
2017-01-28 21:06:11,745 : INFO : PROGRESS: at sentence #600000, processed 12910549 words, keeping 229396 word types
2017-01-28 21:06:27,683 : INFO : PROGRESS: at sentence #800000, processed 17357138 words, keeping 275456 word types
2017-01-28 21:06:42,434 : INFO : PROGRESS: at sentence #1000000, processed 21567523 words, keeping 315310 word types
2017-01-28 21:06:48,070 : INFO : collected 329953 word types from a corpus of 23197079 raw words and 1074524 sentences
2017-01-28 21:06:48,070 : INFO : Loading a fresh vocabulary
2017-01-28 21:06:48,355 : INFO : min_count=5 retains 75839 unique words (22% of original 329953, drops 254114)
2017-01-28 21:06:48,356 : INFO : min_count=5 leaves 22838567 word corpus (98% of original 23197079, drops 358512)
2017-01-28 21:06:48,531 : INFO : deleting the raw counts dictionary of 329953 items
2017-01-28 21:06:48,547 : INFO : sample=0.001 downsamples 45 most-common words
2017-01-28 21:06:48,548 : INFO : downsampling leaves estimated 17632093 word corpus (77.2% of prior 22838567)
2017-01-28 21:06:48,549 : INFO : estimated required memory for 75839 words and 100 dimensions: 98590700 bytes
2017-01-28 21:06:48,787 : INFO : resetting layer weights

Training the model

In [5]:
model.train(SentGen('aclImdb'), report_delay=8.0)
2017-01-28 21:08:30,135 : INFO : training model with 3 workers on 75839 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2017-01-28 21:08:30,136 : INFO : expecting 1074524 sentences, matching count from corpus used for vocabulary survey
2017-01-28 21:08:31,200 : INFO : PROGRESS: at 1.13% examples, 187779 words/s, in_qsize 0, out_qsize 0
2017-01-28 21:08:39,211 : INFO : PROGRESS: at 10.64% examples, 199903 words/s, in_qsize 0, out_qsize 0
2017-01-28 21:08:47,224 : INFO : PROGRESS: at 19.88% examples, 202659 words/s, in_qsize 0, out_qsize 0
2017-01-28 21:08:55,228 : INFO : PROGRESS: at 29.10% examples, 202096 words/s, in_qsize 0, out_qsize 0
2017-01-28 21:09:03,237 : INFO : PROGRESS: at 38.55% examples, 202158 words/s, in_qsize 0, out_qsize 0
2017-01-28 21:09:11,254 : INFO : PROGRESS: at 47.66% examples, 203058 words/s, in_qsize 0, out_qsize 0
2017-01-28 21:09:19,273 : INFO : PROGRESS: at 56.76% examples, 203384 words/s, in_qsize 0, out_qsize 0
2017-01-28 21:09:27,291 : INFO : PROGRESS: at 65.92% examples, 203928 words/s, in_qsize 0, out_qsize 0
2017-01-28 21:09:35,299 : INFO : PROGRESS: at 75.03% examples, 204065 words/s, in_qsize 0, out_qsize 0
2017-01-28 21:09:43,313 : INFO : PROGRESS: at 84.45% examples, 203859 words/s, in_qsize 0, out_qsize 0
2017-01-28 21:09:51,339 : INFO : PROGRESS: at 93.83% examples, 203649 words/s, in_qsize 0, out_qsize 0
2017-01-28 21:09:56,800 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-01-28 21:09:56,801 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-01-28 21:09:56,813 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-01-28 21:09:56,815 : INFO : training on 23197079 raw words (17631244 effective words) took 86.6s, 203531 effective words/s
Out[5]:
17631244

Model Saving

In [6]:
model.save('aclImdb.model')
2017-01-28 21:10:02,591 : INFO : saving Word2Vec object under aclImdb.model, separately None
2017-01-28 21:10:02,592 : INFO : not storing attribute syn0norm
2017-01-28 21:10:02,593 : INFO : not storing attribute cum_table
2017-01-28 21:10:03,403 : INFO : saved aclImdb.model

Model Loading

Once a model was saved, it can be later loaded back to memory and trained with more sentences

In [7]:
# model = gensim.models.Word2Vec.load('model1')
# model.train(more_sentences)

Using the model

How many words does our model have? We simply need to check the size of our vocabulary with the Python len function:

In [8]:
len(model.wv.vocab)
Out[8]:
75839

Is the word 'women' in our vocabulary?

In [9]:
'women' in model.wv.vocab
Out[9]:
True

Getting the list of all words in our vocabulary is easy. We can even sort them with the Python sorted function. We can also print and save them to a file.

In [10]:
words = sorted(model.wv.vocab.keys())
In [11]:
print("Number of words:", len(words))
Number of words: 75839
In [15]:
# Save words to file: words.txt
fp = open("words.txt", "w", encoding="utf-8")
for word in words:
    fp.write(word + '\n')
fp.close()
In [16]:
print (words[1500:1550]) # print 50 words from 1550 to 1549
['Afterschool', 'Aftershocks', 'Afterward', 'Afterwards', 'Afterwords', 'Aga', 'Agador', 'Again', 'Against', 'Agamemnon', 'Agar', 'Agars', 'Agashe', 'Agatha', 'Age', 'Aged', 'Agency', 'Agenda', 'Agent', 'Agents', 'Ages', 'Agey', 'Aggie', 'Agi', 'Aging', 'Agnes', 'Agnew', 'Agnieszka', 'Agnihotri', 'Agnus', 'Agnès', 'Ago', 'Agostino', 'Agrabah', 'Agrade', 'Agree', 'Agreed', 'Agreement', 'Agren', 'Agrippina', 'Agro', 'Aguilar', 'Aguirre', 'Agustin', 'Agutter', 'Agutters', 'Ah', 'Aha', 'Ahab', 'Ahead']

Word similarity

One of the methods for checking model quality is to check if it reports high level of similarity between two semantically (or syntactically) equivalent words. Semantically similar words are expected to be near each other within our vector space. The gensim model.similarity methid is checking this sort of proximity and returns a real number from 0 to 1 that measures the amount of proximity.

However, keep in mind that our text corpus is relatively small (340MB text size with only 75K words), so our vector space is not expected to be fully adequate.

In [17]:
model.similarity('woman', 'man')
Out[17]:
0.87441004776040909
In [18]:
model.similarity('cat', 'dog')
Out[18]:
0.8850999746465289
In [19]:
model.similarity('paris', 'train')  # low similarity
Out[19]:
0.25187103395420385
In [20]:
model.similarity('king', 'prince')
Out[20]:
0.82048178119674831
In [21]:
model.similarity('king', 'queen')
Out[21]:
0.8018559473097695

Unmatching word game

Another way to test if our word2vec model faithfully reflects our text corpus structure, is to check if it can separate a group of words to subgroups of related words. The gensim doesnt_match method accepts a group of words and reports which word in the group does not match the other words. A few examples can explain this better:

In [22]:
def get_unmatching_word(words):
    for word in words:
        if not word in model.wv.vocab:
            print("Word is not in vocabulary:", word)
            return None
    return model.wv.doesnt_match(words)
In [23]:
get_unmatching_word(['breakfast', 'cereal', 'dinner', 'lunch'])
2017-01-28 21:15:37,728 : INFO : precomputing L2-norms of word weight vectors
Out[23]:
'cereal'
In [24]:
get_unmatching_word(['saturday', 'sunday', 'friday', 'spoon', 'weekday'])
Out[24]:
'spoon'
In [25]:
get_unmatching_word(['king', 'queen', 'prince', 'fork', 'castle'])
Out[25]:
'fork'

Semantic Relations in Vector Space

In [27]:
# The word 'woman' comes out 6th as the most similar
model.most_similar(positive=['king', 'man'], negative=['queen'], topn=6)
Out[27]:
[('person', 0.7300724387168884),
 ('guy', 0.6826455593109131),
 ('soldier', 0.6546259522438049),
 ('killer', 0.6529239416122437),
 ('boy', 0.6404286623001099),
 ('woman', 0.6257007122039795)]
In [32]:
# The word 'higher' comes out 1st as the most similar, but also nottice the other words ...
model.most_similar(positive=['low', 'lower'], negative=['high'], topn=10)
Out[32]:
[('higher', 0.7889082431793213),
 ('funnier', 0.6868950128555298),
 ('greater', 0.6742820739746094),
 ('More', 0.6719484329223633),
 ('bigger', 0.665863573551178),
 ('cheaper', 0.6584792137145996),
 ('dumber', 0.6579981446266174),
 ('quicker', 0.6447793245315552),
 ('harder', 0.6422368288040161),
 ('scarier', 0.6412309408187866)]
In [33]:
# The word 'England' comes out 5th as the most similar
model.most_similar(positive=['Paris', 'France'], negative=['London'], topn=10)
Out[33]:
[('Italy', 0.8492923974990845),
 ('Germany', 0.8089585304260254),
 ('Spain', 0.7912108302116394),
 ('Australia', 0.7890660762786865),
 ('India', 0.7824097871780396),
 ('England', 0.7786306142807007),
 ('Japan', 0.7783803939819336),
 ('Canada', 0.7750258445739746),
 ('Africa', 0.7709455490112305),
 ('Mexico', 0.7674878239631653)]
In [34]:
# The word 'Italy' comes out 1st as the most similar
model.most_similar(positive=['Paris', 'France'], negative=['Rome'], topn=10)
Out[34]:
[('Italy', 0.7843865752220154),
 ('Germany', 0.7800211906433105),
 ('England', 0.7736520171165466),
 ('Japan', 0.7658747434616089),
 ('America', 0.7626687288284302),
 ('Africa', 0.7617301940917969),
 ('Europe', 0.7588669061660767),
 ('London', 0.7534858584403992),
 ('India', 0.7437407970428467),
 ('Mexico', 0.7425971031188965)]
In [35]:
# The word 'daughter' comes out 3rd as the most similar
model.most_similar(positive=['father', 'son'], negative=['mother'], topn=10)
Out[35]:
[('brother', 0.9043227434158325),
 ('wife', 0.8852354288101196),
 ('daughter', 0.8783684968948364),
 ('sister', 0.8609746694564819),
 ('girlfriend', 0.8486406803131104),
 ('dad', 0.8320426940917969),
 ('uncle', 0.8199210166931152),
 ('grandfather', 0.8196334838867188),
 ('husband', 0.8182598352432251),
 ('partner', 0.8066145181655884)]
In [36]:
# The word 'boy' comes out 1st as the most similar
model.most_similar(positive=['father', 'girl'], negative=['mother'], topn=10)
Out[36]:
[('boy', 0.9240204095840454),
 ('woman', 0.8523120880126953),
 ('dog', 0.843636155128479),
 ('lady', 0.8371353149414062),
 ('man', 0.8356690406799316),
 ('doctor', 0.8223854899406433),
 ('soldier', 0.7872583270072937),
 ('kid', 0.7855633497238159),
 ('guy', 0.7785974740982056),
 ('priest', 0.7694289088249207)]
In [47]:
# The word 'cats' comes out 8th as the most similar (still, out of 75000 words ...)
model.most_similar(positive=['dog', 'dogs'], negative=['cat'], topn=10)
Out[47]:
[('parents', 0.8059244751930237),
 ('neighbors', 0.773928165435791),
 ('bodies', 0.7607775330543518),
 ('men', 0.7583969831466675),
 ('cops', 0.7558149695396423),
 ('aliens', 0.7505661249160767),
 ('criminals', 0.7494717836380005),
 ('boys', 0.7471839189529419),
 ('cats', 0.7450538277626038),
 ('sons', 0.7388671636581421)]

The significance of these similarities is that the word2vec embedding is somehow reflecting the semantical and syntactical structure of the text corpus. In this case we have less than 76K words, our text corpus is not large enough, and our vectors are too short: only 100 components (according to research you'll need between 300 to 500 long vectors to start getting accurate results). So this kind of similarities are a bit loose in our IMDB test case (many analogous words are missing or occuring only a small number of times in the texts). You can find in the web much larger text corpuses with millions of words in which this sort of vector algebra is highly matching the syntactical structure more closely.

Accessing the vectors

If you need to access the vector representation of word like 'king', this is very simple in gensim. The model object can be simply indexed with any word in the vocabulary:

In [50]:
print(model['king'])
[-0.30519533 -0.22001271 -0.19165316  0.54016495  0.14842695  0.3341189
 -0.27410319 -0.14967868  0.30878687  0.07981905  0.02793549 -0.35460964
 -0.28293291 -0.2175625   0.12128358  0.02938185 -0.33233708  0.59154058
  0.10021219 -0.43497828  0.15408696  0.02108467  0.16169284  0.10522096
  0.26249099 -0.24814126  0.62219959 -0.59225398  0.31051928  0.1080844
 -0.42813206  0.06505097  0.05849557 -0.11907697  0.00762035 -0.17970887
  0.37005383 -0.51711005  0.26633534  0.1862317  -0.23608164 -0.20678528
 -0.03484735 -0.0837178   0.35746735 -0.21243036  0.25288612 -0.15052553
 -0.32116109 -0.24989396 -0.20516069 -0.17611451  0.44492254  0.06060296
  0.38339406  0.327519    0.30487871 -0.57657099  0.63259524 -0.48328778
 -0.13733746 -0.24184161  0.51379865 -0.07858841 -0.53337961 -0.09756016
  0.16807282  0.14098307  0.5857296   0.56171465 -0.08030809  0.20924905
 -1.13021338 -0.10474633  0.34811604  0.0907301  -0.01749143  0.20910436
  0.0564938  -0.49025175 -0.01215225 -0.46975541 -0.24771205  0.51404476
 -0.40933108  0.12289272 -0.01927208  0.17445983  0.07979318  0.05067573
  0.0618048  -0.06054457 -0.56369454  0.00837194  0.33415878  0.54157025
  0.60349584  0.24886353 -0.22393683  0.19365866]
In [51]:
print(model['king'].size)  # vector size
100

You can also perform some vector algebra for checking all kinds of relations between words

In [52]:
d = (model['king'] - model['man']) - (model['queen'] - model['woman'])
np.sqrt(np.mean(d**2))
Out[52]:
0.59759963
In [53]:
d = (model['king'] - model['man']) - (model['cat'] - model['desk'])
np.sqrt(np.mean(d**2))
Out[53]:
1.4012552
In [54]:
d = (model['he'] - model['his']) - (model['she'] - model['her'])
np.sqrt(np.mean(d**2))
Out[54]:
0.55723441
In [55]:
d = (model['he'] - model['his']) - (model['dog'] - model['cat'])
np.sqrt(np.mean(d**2))
Out[55]:
1.5773257
In [56]:
d = (model['good'] - model['bad']) - (model['strong'] - model['weak'])
np.sqrt(np.mean(d**2))
Out[56]:
0.90884268
In [57]:
d = (model['good'] - model['bad']) - (model['strong'] - model['small'])
np.sqrt(np.mean(d**2))
Out[57]:
1.3109543