4.2 NLP之词嵌入
词嵌入简介
前面的词条化与序列化已经让我们得到代表和句子的数字。那么我们怎么用来情感分析呢,需要我们像CNN那样提取一些特征,从这些语句中得到相关信息。
嵌入的核心就是将所有相关的词汇,聚类为多维空间中的向量。
下面我们通过将一个电影评论分类的嵌入来学习这一部分,其中所有的电影评论被分为正面评论和负面评论。这部分imdb_reviews数据集也是官方提供的。
通过已有的评论标签,tensorflow可以将不同类型评论的词语进行聚类。
嵌入实战
实战素材就用我们上面提到的电影评论。
import tensorflow as tf
import tensorflow_datasets as tfds
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)
import numpy as np
train_data, test_data = imdb['train'], imdb['test']
training_sentences = []
training_labels = []
testing_sentences = []
testing_labels = []
# str(s.tonumpy()) is needed in Python3 instead of just s.numpy()
for s,l in train_data:
training_sentences.append(str(s.numpy()))
training_labels.append(l.numpy())
for s,l in test_data:
testing_sentences.append(str(s.numpy()))
testing_labels.append(l.numpy())
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)
vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<OOV>"
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type, padding='post')
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)
和之前相比其实就是模型定义这里多了嵌入层Embedding。
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(6, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 120, 16) 160000
_________________________________________________________________
flatten (Flatten) (None, 1920) 0
_________________________________________________________________
dense (Dense) (None, 6) 11526
_________________________________________________________________
dense_1 (Dense) (None, 1) 7
=================================================================
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
_________________________________________________________________
num_epochs = 10
model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))
Epoch 1/10
782/782 [==============================] - 3s 4ms/step - loss: 0.5031 - accuracy: 0.7325 - val_loss: 0.3580 - val_accuracy: 0.8434
Epoch 2/10
782/782 [==============================] - 3s 4ms/step - loss: 0.2423 - accuracy: 0.9035 - val_loss: 0.3768 - val_accuracy: 0.8336
Epoch 3/10
782/782 [==============================] - 3s 4ms/step - loss: 0.0905 - accuracy: 0.9759 - val_loss: 0.4720 - val_accuracy: 0.8172
Epoch 4/10
782/782 [==============================] - 3s 4ms/step - loss: 0.0220 - accuracy: 0.9971 - val_loss: 0.5586 - val_accuracy: 0.8145
Epoch 5/10
782/782 [==============================] - 3s 4ms/step - loss: 0.0059 - accuracy: 0.9996 - val_loss: 0.6386 - val_accuracy: 0.8089
Epoch 6/10
782/782 [==============================] - 3s 4ms/step - loss: 0.0018 - accuracy: 1.0000 - val_loss: 0.6842 - val_accuracy: 0.8148
Epoch 7/10
782/782 [==============================] - 3s 4ms/step - loss: 8.6061e-04 - accuracy: 1.0000 - val_loss: 0.7334 - val_accuracy: 0.8134
Epoch 8/10
782/782 [==============================] - 3s 4ms/step - loss: 4.8114e-04 - accuracy: 1.0000 - val_loss: 0.7770 - val_accuracy: 0.8129
Epoch 9/10
782/782 [==============================] - 3s 4ms/step - loss: 2.7152e-04 - accuracy: 1.0000 - val_loss: 0.8162 - val_accuracy: 0.8138
Epoch 10/10
782/782 [==============================] - 3s 4ms/step - loss: 1.6482e-04 - accuracy: 1.0000 - val_loss: 0.8588 - val_accuracy: 0.8134
[6]:
<tensorflow.python.keras.callbacks.History at 0x7fce50224350>
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)
## (10000, 16)
下面我们可以将数据保存为两个文件'vecs.tsv'和'meta.tsv'。
到https://projector.tensorflow.org/ 中加载这两个文件,就可以可视化这个嵌入的详情。
import io
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
word = reverse_word_index[word_num]
embeddings = weights[word_num]
out_m.write(word + "\n")
out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()
Comments NOTHING