Нейронная языковая модель. Векторы слов.

In [5]:
%matplotlib inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('pdf', 'svg')
In [6]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import gensim.parsing.preprocessing as gp
import nltk
from sklearn import feature_extraction, metrics
from sklearn import naive_bayes, linear_model, svm
from sklearn.preprocessing import Binarizer
from keras import models, layers, utils, callbacks, optimizers
from itertools import chain
import json

Языковая модель задает распределение вероятности над строками языка. Если строки состоят из слов (есть также посимвольные модели), то по сути модель задаёт распределение вероятностей вида $P(w_1,w_2,...,w_n)$, где $n$ - длина строки. Типичная модель описывает распределение $P(w_i~|~w_1,...,w_{i-1})$, тогда $P(w_1,w_2,...,w_n) = P(w_n~|~w_1,w_2,...,w_{n-1})P(w_1,w_2,...,w_{n-1})$.

Марковские модели $k$-го порядка упрощают распределение $P(w_i~|~w_1,...,w_{i-1})$ до $P(w_i~|~w_{i-k-1},...,w_{i-1})$, т.е. вероятность следующего слова зависит только от $k+1$ предыдущих слов. Они также называются n-gram моделями. Модель нулевого порядка называется униграм моделью, первого - биграм моделью, второго, третьего и четвертого - триграм-моделью, 4-грам, 5-грам и т.д.

Языковые модели активно применяются в задачах интерпретации, например, при распознавании речи. Аналогично можно искать наиболее вероятное исправление текста, наиболее вероятный перевод фразы. Мы часто сталкиваемся с языковыми моделями когда набираем текст на телефоне. Разумеется, сэмплирование из языковой модели позволяет генерировать текст. В зависимости от типа модели и данных, на которых она натренирована, данный текст будет в большей или меньшей степени похож на "настоящий". В этой тетради мы построим модель, натренированную на форумных сообщениях из набора данных 20 newsgroups.

In [7]:
train_data = fetch_20newsgroups(subset='train',remove=['headers', 'footers', 'quotes'])
test_data = fetch_20newsgroups(subset='test',remove=['headers', 'footers', 'quotes'])
In [8]:
text = train_data.data[0]
print(text)
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Поскольку наша задача - генерировать более-менее натуральный текст, мы воспользуемся куда более аккуратной токенизацией из библиотеки NLTK, по сравнению с грубой обработкой в прошлом туториале. Единственное, текст все равно переведен в нижний регистр и соотв. генерироваться будет аналогично, это сделано, чтобы немного уменьшить объём данных для тренировки и объём словаря.

In [9]:
def tokenized(documents):
    def process_document(doc: str):
        words = nltk.tokenize.word_tokenize(doc)
        return [w.lower() for w in words]
    return  [process_document(doc) for doc in documents] 
In [10]:
tokens_train = tokenized(train_data.data)

Мы используем и тренировочные и тестовые данные из задачи классификации для тренировки модели.

In [11]:
tokens_train.extend(tokenized(test_data.data))
In [12]:
print(tokens_train[0])
['i', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'i', 'saw', 'the', 'other', 'day', '.', 'it', 'was', 'a', '2-door', 'sports', 'car', ',', 'looked', 'to', 'be', 'from', 'the', 'late', '60s/', 'early', '70s', '.', 'it', 'was', 'called', 'a', 'bricklin', '.', 'the', 'doors', 'were', 'really', 'small', '.', 'in', 'addition', ',', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', '.', 'this', 'is', 'all', 'i', 'know', '.', 'if', 'anyone', 'can', 'tellme', 'a', 'model', 'name', ',', 'engine', 'specs', ',', 'years', 'of', 'production', ',', 'where', 'this', 'car', 'is', 'made', ',', 'history', ',', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', ',', 'please', 'e-mail', '.']

Назначим каждому слову номер, используя CountVectorizer. Поскольку эта тетрадь запускалась несколько раз с целью дотренировки модели, было решено сохранить полученный словарь в файл и загружать его, поскольку я не уверен в детерминированности CountVectorizer. Поскольку тренируется 4-грам модель $P(w_4|w_1,w_2,w_3)$, введены специальные символы для начала текста <S>, конца </S>. Поскольку словарь модели будет ограничен, все слова встречающиеся менее 40000 раз будут заменены на спец. слово <UNK>. Вообще говоря, это спец. слово можно разнообразить, например, ввести слова <UNK_NOUN>, <UNK_VERB>, соотв. частям речи, но здесь мы этого не делаем.

In [9]:
# count_vectorizer = feature_extraction.text.CountVectorizer(preprocessor=lambda x:x,
#                                                            tokenizer=lambda x:x, max_features=40000)
# count_vectorizer.fit(tokens_train)
# feature_names = count_vectorizer.get_feature_names()
# feature_names.append('<UNK>')
# feature_names.append('<S>')
# feature_names.append('</S>')

# with open('nltk_feature_names4_wm.json','w+') as of:
#     json.dump(feature_names,of)

Откроем файл со словарем (списком слов). Используя этот список назначим каждому слову номер (индекс в списке). Преобразуем все тексты (списки слов) в списки номеров этих слов.

In [13]:
with open('nltk_feature_names4_wm.json') as f:
    feature_names = json.load(f)
vocab = {v:i for i,v in enumerate(feature_names)}
unk_index = vocab['<UNK>']
n_words = len(feature_names)
end_index = vocab['</S>']
start_index = vocab['<S>']

ids_train = []
ids_test = []
for row in tokens_train:
    ids_train.append([vocab.get(word, unk_index) for word in row])

Полученные тексты в виде списков номеров разобъём на четвёрки (4-грамы) и поместим их в единый массив, выведем первые 10 элементов. Всего в массиве 4760679 n-gramов

In [14]:
ngrams = []
for row in ids_train:
    for ngram in nltk.ngrams(row,4,pad_left=True, pad_right=True,
                               left_pad_symbol=start_index, right_pad_symbol=end_index):
        ngrams.append(ngram)
        
print(ngrams[:10])
print(len(ngrams))
    
[(40001, 40001, 40001, 20442), (40001, 40001, 20442, 38521), (40001, 20442, 38521, 38996), (20442, 38521, 38996, 20582), (38521, 38996, 20582, 7435), (38996, 20582, 7435, 26981), (20582, 7435, 26981, 36183), (7435, 26981, 36183, 13070), (26981, 36183, 13070, 16741), (36183, 13070, 16741, 24293)]
4760679

Преобразуем в numpy формат.

In [15]:
ngrams = np.array(ngrams)
print(ngrams.shape)
(4760679, 4)

В качестве модели распределения $P(w_4|w_1,w_2,w_3)$ будем использовать нейронную сеть. Есть более простые, собственно n-gram модели, которые оценивают эти вероятности через относительные частоты и сглаживание (аналогично описанному в прежней тетради Наивному Байесовскому классификатору). Они тренируются очень быстро, но при этом потребляют много памяти. Выбор нейронной сети обусловлен некоторыми интересными свойствами получаемых моделей, а также их более высоким в целом качеством. Качество модели замеряется как правило через перплексию и кросс-энтропию - среднюю неопределенность следующего слова при известных предыдущих (меньше - лучше).

Модели на вход поставляются 3 слова в виде one-hot векторов, т.е. 40000-мерных векторов, в которых все элементы, кроме одного равного единице, равны нулю. Для каждого из трех слов используется одна и та же матрица весов, которая умножается на этот вектор, результат равен одному из её столбцов. Этот столбец называется вектором слова, а само преобразование категориальной переменной в вектор - встраиванием (embedding). Матрица весов соотв. называется встраивающей матрицей (Embedding матрицей). Полученные три вектора конкатенируются и подаются на скрытые слои сети, которые производят различные преобразования над ними. Последний слой использует функцию softmax для создания дискретного распределения вероятности над 40000 словами.

Сети нужны one-hot векторы в качестве эталонных выходов, и преобразовать все выходы в них сразу было бы убийственно по памяти, поэтому генерироваться экземпляры для тренировки будут налету. Сеть будет тренироваться пачками экземпляров фиксированного размера и перед каждой итерацией эти пачки будут генерироваться нижеописанной функцией. Три номера с каждого экземпляра попадают в матрицу $X$, а последний номер преобразовывается в one-hot вектор и попадает в матрицу $y$.

In [16]:
def make_batch(batch_size=128):
    while True:
        indices = np.random.randint(0, len(ngrams),size = batch_size)
        rows = ngrams[indices]
        X = rows[:,:-1]
        labels = rows[:,-1]
        y = utils.to_categorical(labels, num_classes=n_words)
        yield X,y
In [17]:
XX,yy = next(make_batch())
print(XX.shape, XX.dtype)
print(yy.shape)
(128, 3) int64
(128, 40003)

Построим сеть. Её первый слой задает для каждого слова обучаемый вектор размером 130. Далее эти векторы конкатенируются и подаются на следующий слой из 400 элементов, с кусочно-линейной активацией. Далее есть спец.слой нормализации, который проводит простое преобразование данных так, чтобы их среднее было близко к 0, а стандартное отклонение к 1. Нормализация данных часто ускоряет обучение сети, хотя я уже не помню, помогла ли она здесь. В любом случае этот слой не навредил. Последний слой имеет размерность 40003 и на его выходе распределение вероятностей (т.е. его выход суммируется в 1). softmax активация дает модели гибкость в приближении выхода по форме к one-hot вектору (см. введение в sklearn про softmax).

In [15]:
model = models.Sequential()
model.add(layers.Embedding(input_dim=n_words,output_dim=130, input_length=3))
model.add(layers.Flatten())
model.add(layers.Dense(400))
model.add(layers.Activation('relu'))
model.add(layers.BatchNormalization())
model.add(layers.Dense(units=n_words, activation='softmax'))

optimizer = optimizers.Adagrad()
model.compile(optimizer,loss='categorical_crossentropy')

# model = models.load_model('language_model_nltk42best_wm.h5')

Долго и мучительно тренируем модель. Хотя в данном случае использование оптимизатора Adagrad заметно улучшило скорость.

In [16]:
# model.fit_generator(make_batch(128),steps_per_epoch=2000,epochs=50, validation_data=make_batch(128),
#                     validation_steps=40,
#                     callbacks=[ callbacks.ModelCheckpoint('language_model_nltk42best_wm.h5',save_best_only=True),
#                                 callbacks.ModelCheckpoint('language_model_nltk42latest_wm.h5')])
Epoch 1/50
2000/2000 [==============================] - 239s 119ms/step - loss: 5.8739 - val_loss: 5.4960
Epoch 2/50
2000/2000 [==============================] - 222s 111ms/step - loss: 5.4761 - val_loss: 5.3675
Epoch 3/50
2000/2000 [==============================] - 222s 111ms/step - loss: 5.3233 - val_loss: 5.3186
Epoch 4/50
2000/2000 [==============================] - 222s 111ms/step - loss: 5.2353 - val_loss: 5.1612
Epoch 5/50
2000/2000 [==============================] - 223s 111ms/step - loss: 5.1647 - val_loss: 5.1096
Epoch 6/50
2000/2000 [==============================] - 223s 112ms/step - loss: 5.1029 - val_loss: 5.0213
Epoch 7/50
2000/2000 [==============================] - 224s 112ms/step - loss: 5.0634 - val_loss: 5.0833
Epoch 8/50
2000/2000 [==============================] - 224s 112ms/step - loss: 5.0199 - val_loss: 5.0254
Epoch 9/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.9786 - val_loss: 4.9600
Epoch 10/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.9551 - val_loss: 4.9790
Epoch 11/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.9090 - val_loss: 4.8264
Epoch 12/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.9122 - val_loss: 4.8699
Epoch 13/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.8560 - val_loss: 4.6862
Epoch 14/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.8329 - val_loss: 4.8340
Epoch 15/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.8084 - val_loss: 4.7514
Epoch 16/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.7890 - val_loss: 4.7303
Epoch 17/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.7662 - val_loss: 4.7798
Epoch 18/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.7607 - val_loss: 4.6492
Epoch 19/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.7393 - val_loss: 4.7335
Epoch 20/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.7209 - val_loss: 4.6822
Epoch 21/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.6981 - val_loss: 4.6139
Epoch 22/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.6816 - val_loss: 4.7003
Epoch 23/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.6666 - val_loss: 4.6773
Epoch 24/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.6517 - val_loss: 4.6494
Epoch 25/50
2000/2000 [==============================] - 225s 112ms/step - loss: 4.6395 - val_loss: 4.6543
Epoch 26/50
2000/2000 [==============================] - 225s 112ms/step - loss: 4.6253 - val_loss: 4.5916
Epoch 27/50
2000/2000 [==============================] - 225s 112ms/step - loss: 4.6153 - val_loss: 4.5605
Epoch 28/50
2000/2000 [==============================] - 225s 112ms/step - loss: 4.5867 - val_loss: 4.5463
Epoch 29/50
2000/2000 [==============================] - 225s 112ms/step - loss: 4.5656 - val_loss: 4.5660
Epoch 30/50
2000/2000 [==============================] - 225s 112ms/step - loss: 4.5732 - val_loss: 4.5540
Epoch 31/50
2000/2000 [==============================] - 225s 112ms/step - loss: 4.5564 - val_loss: 4.5457
Epoch 32/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.5416 - val_loss: 4.4679
Epoch 33/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.5356 - val_loss: 4.5567
Epoch 34/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.5188 - val_loss: 4.5714
Epoch 35/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.5177 - val_loss: 4.5455
Epoch 36/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.5085 - val_loss: 4.4175
Epoch 37/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.5038 - val_loss: 4.4201
Epoch 38/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.4900 - val_loss: 4.4974
Epoch 39/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.4769 - val_loss: 4.4232
Epoch 40/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.4453 - val_loss: 4.3921
Epoch 41/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.4637 - val_loss: 4.4366
Epoch 42/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.4402 - val_loss: 4.4731
Epoch 43/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.4391 - val_loss: 4.4755
Epoch 44/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.4346 - val_loss: 4.4336
Epoch 45/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.4083 - val_loss: 4.3839
Epoch 46/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.4065 - val_loss: 4.4527
Epoch 47/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.3924 - val_loss: 4.4027
Epoch 48/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.3887 - val_loss: 4.2692
Epoch 49/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.3753 - val_loss: 4.3876
Epoch 50/50
2000/2000 [==============================] - 224s 112ms/step - loss: 4.3814 - val_loss: 4.4006
Out[16]:
<keras.callbacks.History at 0x7f4c0145cc88>
In [148]:
# model.fit_generator(make_batch(128),steps_per_epoch=2000,epochs=50, validation_data=make_batch(128),
#                     validation_steps=100,
#                     callbacks=[ callbacks.ModelCheckpoint('language_model_nltk42best_wm_cont2.h5',save_best_only=True),
#                                 callbacks.ModelCheckpoint('language_model_nltk42latest_wm_cont2.h5')])
Epoch 1/50
2000/2000 [==============================] - 225s 112ms/step - loss: 4.1061 - val_loss: 4.0952
Epoch 2/50
2000/2000 [==============================] - 225s 113ms/step - loss: 4.1083 - val_loss: 4.0874
Epoch 3/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.1060 - val_loss: 4.0673
Epoch 4/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0953 - val_loss: 4.0958
Epoch 5/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0810 - val_loss: 4.0461
Epoch 6/50
2000/2000 [==============================] - 227s 113ms/step - loss: 4.0930 - val_loss: 4.0500
Epoch 7/50
2000/2000 [==============================] - 229s 114ms/step - loss: 4.0822 - val_loss: 4.0516
Epoch 8/50
2000/2000 [==============================] - 228s 114ms/step - loss: 4.0854 - val_loss: 4.0773
Epoch 9/50
2000/2000 [==============================] - 227s 114ms/step - loss: 4.0788 - val_loss: 4.0575
Epoch 10/50
2000/2000 [==============================] - 227s 113ms/step - loss: 4.0709 - val_loss: 4.0657
Epoch 11/50
2000/2000 [==============================] - 227s 113ms/step - loss: 4.0803 - val_loss: 4.0769
Epoch 12/50
2000/2000 [==============================] - 227s 113ms/step - loss: 4.0787 - val_loss: 4.0174
Epoch 13/50
2000/2000 [==============================] - 227s 113ms/step - loss: 4.0587 - val_loss: 4.0720
Epoch 14/50
2000/2000 [==============================] - 227s 113ms/step - loss: 4.0578 - val_loss: 4.0533
Epoch 15/50
2000/2000 [==============================] - 227s 113ms/step - loss: 4.0455 - val_loss: 4.0114
Epoch 16/50
2000/2000 [==============================] - 227s 113ms/step - loss: 4.0457 - val_loss: 4.0705
Epoch 17/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0519 - val_loss: 4.0280
Epoch 18/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0485 - val_loss: 4.0194
Epoch 19/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0411 - val_loss: 4.0257
Epoch 20/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0438 - val_loss: 4.0436
Epoch 21/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0393 - val_loss: 3.9799
Epoch 22/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0274 - val_loss: 4.0101
Epoch 23/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0159 - val_loss: 3.9981
Epoch 24/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0338 - val_loss: 4.0164
Epoch 25/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0287 - val_loss: 3.9516
Epoch 26/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0268 - val_loss: 4.0079
Epoch 27/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0138 - val_loss: 3.9941
Epoch 28/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0113 - val_loss: 4.0059
Epoch 29/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0046 - val_loss: 3.9783
Epoch 30/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0022 - val_loss: 4.0050
Epoch 31/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0089 - val_loss: 3.9646
Epoch 32/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0044 - val_loss: 3.9887
Epoch 33/50
2000/2000 [==============================] - 226s 113ms/step - loss: 3.9982 - val_loss: 3.9883
Epoch 34/50
2000/2000 [==============================] - 226s 113ms/step - loss: 4.0012 - val_loss: 3.9500
Epoch 35/50
2000/2000 [==============================] - 226s 113ms/step - loss: 3.9909 - val_loss: 3.9339
Epoch 36/50
2000/2000 [==============================] - 226s 113ms/step - loss: 3.9868 - val_loss: 3.9960
Epoch 37/50
2000/2000 [==============================] - 226s 113ms/step - loss: 3.9827 - val_loss: 3.9917
Epoch 38/50
2000/2000 [==============================] - 226s 113ms/step - loss: 3.9853 - val_loss: 3.9777
Epoch 39/50
2000/2000 [==============================] - 226s 113ms/step - loss: 3.9779 - val_loss: 3.9893
Epoch 40/50
2000/2000 [==============================] - 226s 113ms/step - loss: 3.9789 - val_loss: 3.9803
Epoch 41/50
2000/2000 [==============================] - 226s 113ms/step - loss: 3.9825 - val_loss: 3.9642
Epoch 42/50
2000/2000 [==============================] - 227s 113ms/step - loss: 3.9764 - val_loss: 3.9475
Epoch 43/50
2000/2000 [==============================] - 227s 114ms/step - loss: 3.9814 - val_loss: 3.9580
Epoch 44/50
2000/2000 [==============================] - 227s 113ms/step - loss: 3.9615 - val_loss: 3.9222
Epoch 45/50
2000/2000 [==============================] - 227s 114ms/step - loss: 3.9654 - val_loss: 3.9844
Epoch 46/50
2000/2000 [==============================] - 227s 113ms/step - loss: 3.9622 - val_loss: 3.9091
Epoch 47/50
2000/2000 [==============================] - 227s 113ms/step - loss: 3.9709 - val_loss: 3.9703
Epoch 48/50
2000/2000 [==============================] - 227s 113ms/step - loss: 3.9456 - val_loss: 3.9548
Epoch 49/50
2000/2000 [==============================] - 227s 113ms/step - loss: 3.9592 - val_loss: 3.9600
Epoch 50/50
2000/2000 [==============================] - 227s 113ms/step - loss: 3.9459 - val_loss: 3.9415
Out[148]:
<keras.callbacks.History at 0x7f4c009e3080>

После каждой крупной итерации модель сохранялась и поскольку она уже натренирована, просто загрузим её из файла.

In [18]:
model = models.load_model('language_model_nltk42latest_wm_cont2.h5')

Эта функция возвращает в порядке убывания N наиболее вероятных слов и их вероятность, при заданных 3х предыдущих.

In [19]:
def distr(nn, start_vector, N):
    start_vector = np.asarray(start_vector)
    pred = nn.predict(start_vector.reshape(1,-1)).ravel()
    ml = np.argsort(pred)[::-1]
    return [ (index, pred[index]) for i, index in zip(range(N), ml)]
In [20]:
for index, prob in distr(model, [vocab['to'], vocab['travel'], vocab['to']], 10):
    print(feature_names[index], prob)
the 0.21830438
<UNK> 0.0982188
a 0.023411596
washington 0.010317443
another 0.009617392
pluto 0.009405057
this 0.007506593
jerusalem 0.0071299984
nasa 0.0069796834
school 0.006576244

Напишем функцию сэмплирования из модели. Ей на вход подается три предыдущих слова, а генерирует она текст длиной k или пока не будет сгенерирован символ конца текста. Для этого используется взвешенный случайный выбор из 40003 слов. По-умолчанию также исключается генерация символа <UNK>.

In [30]:
def sample_from_model(nn, k, length, seed_vector, generate_unk=False):
    results = []
    indices = np.arange(len(feature_names))
    for i in range(k):
        start_vector = np.array(seed_vector)
        res = [feature_names[ind] for ind in start_vector]
        for _ in range(length):
            weights = nn.predict(start_vector.reshape(1,-1)).ravel()
            if not generate_unk:
                weights[unk_index] = 0
                weights /= weights.sum()
            next_ind = np.random.choice(indices,p=weights)
            start_vector[0], start_vector[1], start_vector[2] = start_vector[1], start_vector[2], next_ind
            if next_ind == end_index:
                break
            res.append(feature_names[next_ind])
        results.append(res)
    return results
        
In [31]:
samples = sample_from_model(model, 20, 10, [vocab['to'], vocab['travel'], vocab['to']])
for s in samples:
    print(' '.join(s))
to travel to this arena every rumor ! . i think the attitude
to travel to orbit between the network size price for use windows .
to travel to pluto a widget creation will one only person sometime off
to travel to light with heaps of stone and dust and when i
to travel to his return . this is a good choice . i
to travel to any match law , is termed nearly optimum on secrecy
to travel to the kinsey 's gift and public domain without peer to
to travel to 6-0 a legal place to call for the next day
to travel to the combined with each displays of multiple parents . has
to travel to armenia , the 22.9 ( 1-1 ) 18 tor mark
to travel to lawful lynn mitre corporation . sci.electronics '' _* # 3
to travel to this morning , the vast majority are there any summaries
to travel to developing iraq ( specifically , running yet before what happens
to travel to blue may cost ) $ 3.00 wolverine 1 ( 1982
to travel to heaven ? and building 6 applications . these are you
to travel to the account of the bible hype about this `` post
to travel to telephone and security . i 'm going to do business
to travel to 4gb of xt specification is typically defined a j14 rsh
to travel to path formats of any kind of opinion on particular group
to travel to biological ... he is not hoped that was another woman
In [33]:
samples = sample_from_model(model, 10, 100, [vocab['<S>'], vocab['<S>'], vocab['<S>']])
for s in samples:
    print(' '.join(s[3:]))
    print('---------')
my suggestion i have some kind ( environmental disaster ? senior , government engineering research research secretary of the caucasus . so , in your generation may play less than the best at one point it is actually grown up . -- have you already asked . -- -- -- -- -- -- file : 1 ) what about it ? as a result of n't you had a lot of `` sucking '' the literature of research demand the town of khojaly and a stark of secrecy and probably x $ 25 pad ( increased eisa contract without higher
---------
i have number were hard to give up to range , ... if you were not . such democracies are so such traditions can be complicated . the church came to god 's statement is whether that a bit of paper . but there is at least . but the press is anticipated of my couples who crucified khomeini , communion and said global war called out side effects . we must know why scripture secretly resurrected , and usefulness only human beings have some slack ! !
---------
... the hell actually going to go to nc . postal address -t to all card . i do n't know much , it *was* dull . genesis 0:0
---------
it would be it near the xt editor as well as toronto was the 3 count */ x char **argv ; /* those small display nerve control - contact : david @ stat.com francis schaeffer costs at that table . since we did n't have to pay the creative way , and they 'll know more of any point . when i try to explain the usenet posting the motif so i can vpl research inc. 950 transport board ( and is the relevant function information in which data longer based on the scientific end of the law for a
---------
organization : reform story , the sky is going to fall significantly in sliced of ... and i have n't heard of the small nhl ( or trying to get into the usenet j20 ( and use ) . if that 's going to do this at the us for $ 169 . i was told by the first person if trying that true that people do n't have a poor lan hog , etc . meanwhile , pope lee lady was n't lost the ghost models using educational camping member of his > back . ... ... ... ...
---------
unlikely with someone which is negligent and which there is no evidence which allows us use in a sensible language decoder refers to only tracing . i 'm trying to listen to the men . but that allows the discovery of immigration and grovel at this point in the season , but are only interested in itself the president yourself have said i had just already done from config.sys or autoexec.bat . other graphics items , carderock division hq ? > sl $ . @ & ] m : , ok bell , and i would appreciate to ask a
---------
[ i am name to cross-post it only a person ( in memorial time myths is getting our cops play good people i do seem that you must decide how science compares ... nasa 10.00 in rebellion after god 's existence from a definite effort . to go high . i would n't figure it away . it takes a lot of chances , base copying things from iomega on the market . see how many christians realize which _i_ do care . the government is there this year , behaviour without any advantage ( ver wide ) while we
---------
'' yes , i suppose model 1 is now 1 toshiba , r , c`8ws % % 14 % a86 % a86 % a86 % a86 % % c2 > p & 1eq < = ( < l ; o % 15o ) ; x check_io ( infile , mime , 'ishtar ' were involved with the hard drive and wait , the figure were out of being a big term which their heads are really using their morality in palestine . : if you get the hobby , that is it hard to find out . am i saying
---------
i 'd assume that that were n't stupid . they lose a wiretap chip is a delight in this man . `` typical methods '' , please cock , cramer starts , i 'm willing to buy a straight dollar ? it is the michael | in the way of nations in christian doctrine_ . `` the us that , huh ? the validity of the installation as follows : something a say they are simply straight control up about how the main purpose is to scramble its characteristics : know that he had ruled out . later , we
---------
if they had fired down a mexican relationship , then when it falls down . it refers to the fruit , just for the net '' these figures without aura , but they wanted to attempt left for the problem if you reach that yesterday ? he said , the lord [ jehovah ] and meaning but the situation .
---------

Вытащим встраивающую матрицу из сети (в Keras таки векторы слов являются строками, а не столбцами этой матрицы)

In [26]:
embedding_layer = model.get_layer(index=1)
In [27]:
weight_matrix = embedding_layer.get_weights()[0]
print(weight_matrix.shape)
(40003, 130)

Наиболее интересные свойства нейронных языковых моделей - это особенности векторов. А именно, векторы слов, похожих по смыслу, похожи между собой (особенно если использовать для этого косинус угла между ними). Это связано с тем, что похожие по смыслу слова имеют тенденцию встречаться в похожих контекстах. Распределение вероятности над следующим словом зависит от регионов в непрерывном пространстве, в которых находятся векторы текущих слов. Таким образом, если какое-либо слово вероятно в данном контексте, то вероятны и похожие на него слова, даже если в тренировочном наборе данных это похожее слово в данном контексте никогда не встречалось. Именно этим обуславливается более высокое качество нейронных моделей, можно сказать, что они более "креативны". Векторная семантика активно развивается в настоящее время и для получения векторов слов были выработаны более эффективные метода (например, на задаче предсказания слов в окне вокруг входного 1 слова). Однако всё-таки, рассмотрим векторы слов полученных на неспециализированой под них задаче.

In [28]:
word_row = weight_matrix[vocab['science']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:30]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
science 0.9999999
sociology 0.35702637
engineering 0.35591787
city 0.3472645
framework 0.34674296
religion 0.34157482
bachelor 0.33963746
discusion 0.33832413
raytracing 0.32665583
castleman 0.324735
hansch 0.3223599
iwii 0.31893703
society 0.31802285
*use 0.3145684
diversity 0.30877846
palace 0.30867457
xhibition 0.30741447
ets 0.30490905
company 0.3046248
graph 0.29802108
concise 0.29765123
christianity 0.29756892
farside 0.29726046
subcommittee 0.29398862
phd 0.29330707
afterlife 0.29314205
morality 0.2916542
cnn 0.2913236
microcomputer 0.29107454
lombardi 0.28918776
In [34]:
word_row = weight_matrix[vocab['russia']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
russia 0.99999994
france 0.43210626
finland 0.43084326
homeland 0.42050752
armenia 0.4170525
yugoslavia 0.3844141
austria 0.37654656
italy 0.37373185
africa 0.37339517
germany 0.35798073
canada 0.35491946
syria 0.35394484
britain 0.3460256
mason 0.34450972
auschwitz 0.34376627
sinners 0.3399324
qo 0.33357996
gnostics 0.32889277
1988 0.3282535
+61d9 0.3273915
republic 0.3256469
fatwa 0.3253315
sanctions 0.32187286
bi-weekly 0.3212945
project 0.32074258
In [35]:
word_row = weight_matrix[vocab['woman']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
woman 1.0000001
person 0.45076734
father 0.4482684
guy 0.3858136
mcnutt 0.37751788
sider 0.37388417
acquaintances 0.370646
somebody 0.36993632
officer 0.3674647
weaver 0.3605611
man 0.35939386
koresh 0.3570217
smoker 0.350826
oncologist 0.35020146
sasha 0.34962022
apostate 0.3492546
marina 0.3465214
ozal 0.34310022
physician 0.34087032
baranelli 0.3402592
korpisalo 0.3340821
baz 0.33404708
somone 0.33280975
muslimzade 0.33183035
madman 0.32958093
In [36]:
word_row = weight_matrix[vocab['sister']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
sister 1.0
apostle 0.38274595
brother 0.36352372
catholic 0.34326655
saviour 0.342304
palmach 0.33750206
cousin 0.33382183
vulcan 0.332994
nakhchivanik 0.3311457
coach 0.3227108
joseph 0.32107168
describing 0.31752113
vest 0.3126697
wycliffe 0.31256586
zx-7 0.31069034
`` 0.3101129
bosnian 0.3082302
instructor 0.30689746
tyrannical 0.30233005
st. 0.30029863
muslimzade 0.29969147
father 0.2985149
airbag 0.29736277
babes 0.29461166
maury 0.2932468
In [37]:
word_row = weight_matrix[vocab['god']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
god 0.9999999
christ 0.5597667
jesus 0.48245263
spirit 0.4150712
satan 0.4146759
father 0.4110653
godhead 0.41047603
sin 0.4053636
scripture 0.40060523
lord 0.38706818
he 0.38132656
salvation 0.38121256
muslimzade 0.38049176
sentence 0.38014272
diety 0.38012272
islam 0.3791488
christianity 0.3777405
ours 0.37446618
quran 0.3741362
savior 0.3688683
witt 0.36399773
allah 0.36377764
idolatry 0.36311734
person 0.36046562
nature 0.3577244
In [38]:
word_row = weight_matrix[vocab['keyboard']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
keyboard 0.9999999
client 0.37691864
everex 0.34524086
pc-xview 0.3301989
menus 0.32833576
simm 0.32818374
concidered 0.32794315
server 0.3267703
backwards 0.32626152
mouse 0.32257193
switch 0.3221672
flourish 0.32066652
tray 0.31837994
macs 0.31599477
taped 0.31577414
designer 0.30880216
lightning.mcrcim.mcgill.edu 0.30489397
sympathy 0.30434805
heavy-duty 0.30265313
sprite 0.30071053
textedit 0.296982
triple 0.29667437
386-33 0.29255432
modem 0.29196495
1728 0.2907111
In [41]:
word_row = weight_matrix[vocab['pretty']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
pretty 1.0000001
very 0.5840134
extremely 0.5335987
fairly 0.52427745
*very* 0.5189932
darn 0.47574264
terribly 0.47343314
quite 0.47282675
_very_ 0.44984683
too 0.43508384
elegantly 0.42395595
w/o 0.41980195
rather 0.41940147
comparatively 0.4187031
_too_ 0.4112808
awfully 0.40951702
amazingly 0.4039837
*too* 0.3977423
truly 0.39557523
equally 0.38534164
potentially 0.38429707
doubly 0.38412514
overly 0.3826621
atheistic 0.37632972
incredibly 0.37611324
In [43]:
word_row = weight_matrix[vocab['good']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
good 0.99999994
bad 0.48991653
excellent 0.455421
great 0.41904134
poor 0.41555342
decent 0.41510612
tough 0.40872747
reasonable 0.39244565
better 0.39235488
important 0.39190838
awful 0.37686244
_real_ 0.37561595
cron 0.37273782
competitive 0.3636005
terrific 0.3619932
valid 0.36160132
wonderful 0.36102256
stupid 0.3558522
philosophical 0.35271648
proper 0.3522777
fine 0.35132825
rj-11 0.34075317
profitable 0.33463636
honest 0.32941625
critical 0.32540017
In [46]:
word_row = weight_matrix[vocab['jpg']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
jpg 1.0000001
_5 0.4019403
cross-linked 0.38630542
self-documenting 0.37898585
mips 0.36325777
gif 0.36050418
dxf 0.3604413
system.ini 0.35447952
kh9_ 0.34809864
eps 0.3407935
.drv 0.33873805
kleck 0.33688164
04 0.33484617
bdf 0.33420017
sparcclassic 0.32825452
spd 0.32774496
.xauthority 0.32764104
bmp 0.32657436
image 0.32344937
.pov 0.32001948
*.ini 0.3160319
targa 0.31554887
postcript 0.31534654
.ico 0.31444645
versatile 0.313566
In [47]:
word_row = weight_matrix[vocab['strcmp']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
strcmp 1.0
strncmp 0.6511047
fgets 0.6329852
strlen 0.4983735
get_line 0.48827767
xtaddcallback 0.4830331
/5\c 0.47907937
5d 0.4765886
tanh 0.46220443
/sizeof 0.45834073
xtresizewidget 0.45284238
bla 0.44699386
xsetfunction 0.4457585
fscanf 0.44501188
scrolls_ 0.4422206
fflush 0.42884892
xinstallcolormap 0.41147816
qb*xb 0.4077193
xtnew 0.40750882
xtoffsetof 0.40735152
xstorecolor 0.40715867
xtappnextevent 0.4038584
prototyping 0.40302736
fprintf 0.39409748
distortedreference 0.39259928
In [49]:
word_row = weight_matrix[vocab['jewish']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
jewish 0.99999994
muslim 0.47244522
armenian 0.44387215
hispanic 0.42710745
communal 0.42306012
communist 0.39317256
secular 0.38953194
honest 0.38680086
fascist 0.38412574
christian 0.3840099
arab 0.38256004
turkish 0.38227054
straightforward 0.37759462
arabian 0.37707043
non-christian 0.3751354
inconsistent 0.373676
religious 0.37204915
party 0.37196428
palestinian 0.37128827
unidentified 0.36882576
zoroastrian 0.36729512
younger 0.36258242
fundamentalist 0.3618556
antagonistic 0.35941088
outstanding 0.35790473
In [50]:
word_row = weight_matrix[vocab['gun']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
gun 1.0
handgun 0.5105647
pwm 0.3798025
no-knock 0.37716037
anti-trust 0.36657378
lesbian 0.36430416
cannons 0.35787398
civilian 0.34338418
nhl 0.3243743
vehicular 0.31966922
tae 0.3140675
batf 0.313496
scsi2 0.31275213
0.295 0.31253517
motorcycle 0.3061506
cato 0.3041167
-3- 0.3038187
marina 0.3025621
infamous 0.30008215
strangers 0.29687893
rightful 0.29598337
anti-discrimination 0.29520035
750ss 0.29487464
well-defined 0.29434463
lightweight 0.29383415
In [52]:
word_row = weight_matrix[vocab['satan']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
satan 1.0
khomeini 0.48109674
koresh 0.46945247
christ 0.44991353
god 0.4146759
yeltsin 0.40743545
jesus 0.3992924
coward 0.39742625
themselves 0.39501083
mcconkie 0.39021286
derounian 0.37938684
himself 0.37836868
srebrenica 0.36733407
rhetoric 0.36305803
prophecy 0.35685566
muslimzade 0.35539153
jehovah 0.35307527
children 0.3512423
godhead 0.3472354
scripture 0.3441132
jews 0.3384788
deportation 0.3373965
lucifer 0.33659402
roehm 0.33187848
1304s 0.3293076
In [53]:
word_row = weight_matrix[vocab['friend']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
friend 0.9999999
colleague 0.3796779
girlfriend 0.37700662
schlafly 0.368368
occurrences 0.36512467
doctor 0.3605139
denizen 0.35812283
tale 0.3580027
acupuncturist 0.35441995
norm 0.35169408
co-worker 0.35089755
wife 0.3432543
friends 0.33613276
ayshe 0.3339044
souvenirs 0.33319747
bartel 0.3324245
rosenthall 0.328469
vein 0.3264603
leery 0.32515305
morning 0.32457516
department 0.32029474
ruuttu..16 0.31762543
matthew 0.31504762
berman 0.31254023
matty 0.3121451
In [54]:
word_row = weight_matrix[vocab['clinton']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
clinton 1.0000001
bush 0.5194382
sahl 0.43596432
bede 0.38176543
reagan 0.37529543
applicant 0.37272134
koresh 0.36381096
typesetting 0.35825038
elchibey 0.35786462
roehm 0.3453334
friedman 0.34372526
citizen 0.34005213
weeping 0.3355927
kinsey 0.32454547
atonement 0.31607723
burba 0.31443128
scofield 0.3141542
cbc 0.31246915
fatima 0.3107081
whosoever 0.3091321
cyprus 0.30845815
win3.1 0.30712864
*work* 0.30709323
personalities 0.30408508
melrose 0.30399743
In [57]:
word_row = weight_matrix[vocab['car']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
car 1.0
bike 0.43159813
modem 0.3597206
hard-disk 0.3501223
card 0.3478018
motorcycle 0.3444622
watching 0.3424116
auto 0.33424452
clone 0.32419506
slot 0.3163374
minor 0.31586024
naprosyn 0.31518224
monitor 0.313687
media 0.30613878
argumentation 0.30557615
administration 0.3054061
printer 0.30440098
'poly 0.30321246
powder 0.30304083
machine 0.30254218
game 0.29920396
it 0.2949192
healer 0.2932816
gfa-555 0.2905322
centris610 0.28936297
In [64]:
word_row = weight_matrix[vocab['astronomy']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
astronomy 1.0
5500e 0.38272813
philosphy 0.34679085
aerospace 0.34158596
mcluhan 0.33953476
research 0.33560842
miya 0.33385086
nasa/jsc/gm2 0.33238336
rutgers 0.32581306
aeronautics 0.31762797
ips 0.31700563
kim 0.3144237
unprotected 0.31416076
epsf 0.3099253
shape 0.3094435
uzis 0.30806577
iridium 0.30735624
morphing 0.30486336
386bsd 0.30348885
cramped 0.3020208
combed 0.29691833
damsus 0.2958063
chew 0.2953169
corrective 0.29364485
hollow 0.29291862
In [67]:
word_row = weight_matrix[vocab['mars']]
sims = metrics.pairwise.cosine_similarity(word_row.reshape(1,-1), weight_matrix)

most_similar = np.argsort(sims.ravel())[::-1]
print(sims.shape)
for ms in most_similar[:25]:
    print(feature_names[ms], sims[0,ms])
(1, 40003)
mars 1.0000002
magellan 0.3409891
lunar 0.33808994
space 0.3351627
titan 0.33168378
son-in-law 0.33166203
charles 0.33071432
venus 0.33064112
jupiter 0.33026072
moon 0.33010706
podein 0.32788286
bedouin 0.32135245
bugunlerde 0.32073864
parish 0.31032467
desqview 0.30843624
planetary 0.3081848
cassini 0.30510363
fabrication 0.30183706
television 0.30164337
t45s/ 0.30157107
bonehead 0.30135283
atom 0.30123326
90-91 0.29659033
eschatology 0.29462087
galileo 0.2943912