単語分散表現について学ぼう_2（実践編）

今回は前回勉強した単語の分散表現の実践編です。Wikipediaの日本語版テキストをもとに作られたコーパスを用いて学習を行います。このモデルは入力単語に対して似たような意義を持つ単語を出力するように学習させます。

事前準備
モデルの学習を行う
単語分散表現を学んだ感想

事前準備

まず、データセットのダウンロードと、メインの学習以外のpythonファイルを作成します。

今回のフォルダ構成は下記のとおりです。

utils.py

これはデータセットを読み込むための関数です。

#utils.py

def load_data(filepath, encording='utf-8'):
  with open(filepath, encoding=encording) as f:
    return f.read()

preprocessing.py

build_vocablary: トークナイザーを準備します。
create_dataset: word2vecモデルのトレーニング用の単語ペアとラベルを生成します。
ここで、ポジティブサンプルとネガティブサンプルを作成しています。

#preprocessing.py

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import skipgrams, make_sampling_table

def build_vocablary(text, num_words=None):
    tokenizer = Tokenizer(num_words=num_words, oov_token='<UNK>')
    tokenizer.fit_on_texts([text])
    return tokenizer

def create_dataset(text, vocab, num_words, window_size, negative_samples):
  data = vocab.texts_to_sequences([text]).pop()
  sampling_table = make_sampling_table(num_words)
  couples, labels = skipgrams(data, num_words,
                              window_size=window_size,
                              negative_samples=negative_samples,
                              sampling_table=sampling_table)
  word_target, word_context = zip(*couples)
  word_target = np.reshape(word_target, (-1, 1))
  word_context = np.reshape(word_context, (-1, 1))
  labels = np.asarray(labels)
  return [word_target, word_context], labels

model.py

ここではFunctional APiを用いてモデルの定義を行っています。

__init__メソッドで、モデルの各層を定義しています。
- word_input: 単語の入力層。
- word_embed: 単語の埋め込み層。
- context_input: 文脈の入力層。
- context_embed: 文脈の埋め込み層。
- dot: 単語と文脈の埋め込みベクトルのドット積を計算。
- flatten: ドット積の結果を平坦化。
- dense: 出力層、シグモイド活性化関数を使用。
buildメソッドで、モデルを構築します。
- word_embed: 単語の埋め込みベクトル。
- context_embed: 文脈の埋め込みベクトル。
- dot: 単語と文脈の埋め込みベクトルのドット積。
- flatten: ドット積の結果を平坦化。
- output: 出力層。
- model: 入力と出力を指定してモデルを定義。

このモデルはword_embeddingと名前を付けたEmbedding層を抽出します。この層は最初にランダムに初期化されますが、学習を行うことによって単語を正しく表現できるような重みを学習できます。

# model.py

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dot, Dense, Embedding, Flatten

class EmbeddingModel:
  def __init__(self, vocab_size, emb_dim=100):
    self.word_input = Input(shape=(1,), name='word_input')
    self.word_embed = Embedding(input_dim=vocab_size,
                                output_dim=emb_dim,
                                input_length=1,
                                name='word_embedding')

    self.context_input = Input(shape=(1,), name='context_input')
    self.context_embed = Embedding(input_dim=vocab_size,
                                   output_dim=emb_dim,
                                   input_length=1,
                                   name='context_embedding')

    self.dot = Dot(axes=2)
    self.flatten = Flatten()
    self.dense = Dense(1, activation='sigmoid')

  def build(self):
    word_embed = self.word_embed(self.word_input)
    context_embed = self.context_embed(self.context_input)
    dot = self.dot([word_embed, context_embed])
    flatten = self.flatten(dot)
    output = self.dense(flatten)

    model = Model(inputs=[self.word_input, self.context_input], outputs=output)
    return model

Inference.py

このファイルはmost_similarメゾットを用いて類似度の高い単語を返すようになっています。

__init__:
- クラスの初期化。
- モデルの単語埋め込み層の重みと単語辞書を設定。
most_similar:
- 指定された単語に最も類似した単語を取得。
- 単語のインデックスを取得し、コサイン類似度を計算。
- 類似度の高い単語をリストで返す。
similarity:
- 2つの単語間の類似度を計算。
- 各単語の埋め込みベクトルを取得し、コサイン距離を計算。
_cosine_similarity:
- 指定された単語のインデックスに対するコサイン類似度を計算。
- 他の全ての単語との類似度を計算し、フラットな配列で返す。

座学の際にも学びましたが、単語の分散表現において重要なのはニューラルネットワーク自体ではなくその重みの部分だけです。そのため今回の場合必要になってくるのはword_embeddingと名前を付けたEmbedding層の重みのみが必要になります。そこでget_layerメゾットで指定した層を取り出し、get_weightメゾットを使用して重みを取り出します。

# Inference.py

from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity

class InferenceAPI:
  def __init__(self, model, vocab):
    self.vocab = vocab 
    self.weights = model.get_layer('word_embedding').get_weights()[0]

  def most_similar(self, word, topn=10):
    word_index = self.vocab.word_index.get(word, 1)
    sim = self._cosine_similarity(word_index)

    # (類似度, インデックス) のタプルのリストを作成
    pairs = [(s, i) for i, s in enumerate(sim)]
    pairs.sort(reverse=True)
    pairs = pairs[1:topn+1]

    # インデックスを使って vocab.index_word から単語にアクセス
    res = [(self.vocab.index_word[i], s) for s, i in pairs]
    return res

  def similarity(self, word1, word2):
    word1_index = self.vocab.word_index.get(word1, 1)
    word2_index = self.vocab.word_index.get(word2, 1)
    weight1 = self.weights[word1_index]
    wight2 = self.weights[word2_index]
    return cosine(wight1, wight2)

  def _cosine_similarity(self, target_idx):
    target_weight = self.weights[target_idx]
    similarity = cosine_similarity(self.weights, [target_weight])
    return similarity.flatten()

モデルの学習を行う

今回私はモデルの学習を行うファイルをtrain.pyとしました。

1. google driveのマウントと、パスを通しておきます

from google.colab import drive
drive.mount('/content/drive')

import sys 
sys.path.append('/content/drive/MyDrive/自然言語処理編/chapter8')

ライブラリのインポート

これらの行は必要なライブラリやモジュールをインポートしています。pprintはデータ構造を見やすく表示するためのものです。TensorFlow/Kerasのモジュールはモデルの構築と訓練に使用します。カスタムモジュール（InferenceAPI、EmbeddingModel、build_vocablary、create_dataset、load_data）は先ほどの事前準備で用意したpythonファイルから読み込んでいます。

   from pprint import pprint
   from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
   from tensorflow.keras.models import load_model
   from inference import InferenceAPI
   from model import EmbeddingModel
   from preprocessing import build_vocablary, create_dataset
   from utils import load_data

ハイパーパラメータの設定

埋め込み次元（emb_dim）、エポック数（epochs）、モデルを保存するパス（model_path）、ネガティブサンプルの数（negative_samples）、語彙サイズ（num_words）、コンテキストウィンドウサイズ（window_size）を設定。

   emb_dim = 50
   epochs = 10
   model_path = '/content/drive/MyDrive/自然言語処理編/chapter8/model.keras'
   negative_samples = 1
   num_words = 10000
   window_size = 1

コーパスの読み込み

   text = load_data(filepath='/content/drive/MyDrive/自然言語処理編/chapter8/data/ja.text8')

ボキャブラリの構築

テキストデータから語彙を構築し、最も頻繁に使用されるnum_words個の単語に制限しています。

   vocab = build_vocablary(text, num_words)

データセットの作成

テキストデータ、語彙、および指定されたパラメータを使用してトレーニングデータセット（xとy）を作成します。

   x, y = create_dataset(text, vocab, num_words, window_size, negative_samples)

モデルの構築

埋め込みモデルを初期化し、構築し、Adamオプティマイザとバイナリクロスエントロピー損失関数でコンパイルします。

   model = EmbeddingModel(num_words, emb_dim)
   model = model.build()
   model.compile(optimizer='adam', loss='binary_crossentropy')

コールバックの設定

早期停止（過学習を防ぐため）とモデルチェックポイント（トレーニング中に最良のモデルを保存するため）のコールバックを設定します。

   callbacks = [
       EarlyStopping(patience=1),
       ModelCheckpoint(model_path, save_best_only=True)
   ]

モデルの学習

バッチサイズ128でデータセットを使用してモデルを訓練し、指定されたエポック数で、データの20％を検証に使用し、定義されたコールバックを適用します。

   model.fit(
       x,
       y,
       batch_size=128,
       epochs=epochs,
       validation_split=0.2,
       callbacks=callbacks
   )

モデルの評価

最後に実施に日本という単語と類似性の高い単語を求めています。

model = load_model(model_path)
api = InferenceAPI(model, vocab)
pprint(api.most_similar(word="日本"))

まとめると

上記のコードをまとめるとこのようになります。

from pprint import pprint

from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import load_model

from inference import InferenceAPI
from model import EmbeddingModel
from preprocessing import build_vocablary, create_dataset
from utils import load_data

if __name__ == '__main__':
  # ハイパラメータの設定
  emb_dim = 50
  epochs = 10
  model_path = '/content/drive/MyDrive/自然言語処理編/chapter8/model.keras'
  negative_samples = 1
  num_words = 10000
  window_size = 1

  # コーパスの読み込み
  text = load_data(filepath='/content/drive/MyDrive/自然言語処理編/chapter8/data/ja.text8')

  # ボキャブラリの構築
  vocab = build_vocablary(text, num_words)

  # データセットの作成
  x, y = create_dataset(text, vocab, num_words, window_size, negative_samples)

  # モデルの構築
  model = EmbeddingModel(num_words, emb_dim)
  model = model.build()
  model.compile(optimizer='adam', loss='binary_crossentropy')

  # コールバックの用意
  callbacks = [
      EarlyStopping(patience=1),
      ModelCheckpoint(model_path, save_best_only=True)
  ]

  # モデルの学習
  model.fit(
      x,
      y,
      batch_size=128,
      epochs=epochs,
      validation_split=0.2,
      callbacks=callbacks
  )

  # モデルの評価
  model = load_model(model_path)
  api = InferenceAPI(model, vocab)
  pprint(api.most_similar(word="日本"))

このモデルを実行した結果下記のような出力が得られます。

# 結果
[('ブラジル', 0.8584108),
 ('韓国', 0.83080703),
 ('ユネスコ', 0.8248229),
 ('キリスト教', 0.8087141),
 ('エトルリア', 0.8085349),
 ('スウェーデン', 0.80410016),
 ('極東', 0.80402076),
 ('中国', 0.8038982),
 ('アカデミー', 0.8024315),
 ('ヤマハ', 0.8019269)]

単語分散表現を学んだ感想

基本的な仕組みと、モデルの実装方法を学んでみてこれを最初に思い付いた人は天才だなと思いました。また、どのような計算をしているのかを　理解したうえでもしっかりと機械が類似性を判断できるのをすごいと感じました。

次回はgensimを使った単語分差表現の学習を学びます。

この記事は下記の書籍を参考にしています。