好看的课外书,我欲封天耳根小说

新聞中心

這里有您想知道的互聯(lián)網(wǎng)營銷解決方案

CNN也能用于NLP任務，一文簡述文本分類任務的7個模型

本文是我之前寫過的一篇基于推特數(shù)據(jù)進行情感分析的文章(https://ahmedbesbes.com/sentiment-analysis-on-twitter-using-word2vec-and-keras.html)的延伸內(nèi)容。那時我建立了一個簡單的模型：基于 keras 訓練的兩層前饋神經(jīng)網(wǎng)絡。用組成推文的詞嵌入的加權平均值作為文檔向量來表示輸入推文。

南明ssl適用于網(wǎng)站、小程序/APP、API接口等需要進行數(shù)據(jù)傳輸應用場景，ssl證書未來市場廣闊！成為創(chuàng)新互聯(lián)公司的ssl證書銷售渠道，可以享受市場價格4-6折優(yōu)惠！如果有意向歡迎電話聯(lián)系或者加微信：028-86922220（備注：SSL證書合作）期待與您的合作！

我用的嵌入是用 gensim 基于語料庫從頭訓練出來的 word2vec 模型。該是一個二分類任務，準確率能達到 79%。

本文目標在于探索其他在相同數(shù)據(jù)集上訓練出來的 NLP 模型，然后在給定的測試集上對這些模型的性能進行評估。

我們將通過不同的模型(從依賴于詞袋表征的簡單模型到部署了卷積/循環(huán)網(wǎng)絡的復雜模型)了解能否得到高于 79% 的準確率!

首先，將從簡單的模型開始，逐步增加模型的復雜度。這項工作是為了說明簡單的模型也能很有效。

我會進行這些嘗試：

用詞級的 ngram 做 logistic 回歸
用字符級的 ngram 做 logistic 回歸
用詞級的 ngram 和字符級的 ngram 做 Logistic 回歸
在沒有對詞嵌入進行預訓練的情況下訓練循環(huán)神經(jīng)網(wǎng)絡(雙向 GRU)
用 GloVe 對詞嵌入進行預訓練，然后訓練循環(huán)神經(jīng)網(wǎng)絡
多通道卷積神經(jīng)網(wǎng)絡
RNN(雙向 GRU)+ CNN 模型

文末附有這些 NLP 技術的樣板代碼。這些代碼可以幫助你開啟自己的 NLP 項目并獲得最優(yōu)結果(這些模型中有一些非常強大)。

我們還可以提供一個綜合基準，我們可以利用該基準分辨哪個模型最適合預測推文中的情緒。

在相關的 GitHub 庫中還有不同的模型、這些模型的預測結果以及測試集。你可以自己嘗試并得到可信的結果。

 
 
 
 
  
  
  
  import os 
  
  
  
  import re 
  
  
  
   
  
  
  
  import warnings 
  
  
  
  warnings.simplefilter("ignore", UserWarning) 
  
  
  
  from matplotlib import pyplot as plt 
  
  
  
  %matplotlib inline 
  
  
  
   
  
  
  
   
  
  
  
  import pandas as pd 
  
  
  
  pd.options.mode.chained_assignment = None 
  
  
  
  import numpy as np  
  
  
  
  from string import punctuation 
  
  
  
   
  
  
  
  from nltk.tokenize import word_tokenize 
  
  
  
   
  
  
  
  from sklearn.model_selection import train_test_split 
  
  
  
  from sklearn.feature_extraction.text import TfidfVectorizer 
  
  
  
  from sklearn.linear_model import LogisticRegression 
  
  
  
  from sklearn.metrics import accuracy_score, auc, roc_auc_score 
  
  
  
  from sklearn.externals import joblib 
  
  
  
   
  
  
  
  import scipy 
  
  
  
  from scipy.sparse import hstack

一、數(shù)據(jù)預處理

你可以從該鏈接(http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/)下載數(shù)據(jù)集。

加載數(shù)據(jù)并提取所需變量(情感及情感文本)。

該數(shù)據(jù)集包含 1,578,614 個分好類的推文，每一行都用 1(積極情緒)和 0(消極情緒)進行了標記。

作者建議用 1/10 的數(shù)據(jù)進行測試，其余數(shù)據(jù)用于訓練。

 
 
 
 
  
  
  
  data = pd.read_csv('./data/tweets.csv', encoding='latin1', usecols=['Sentiment', 'SentimentText']) 
  
  
  
  data.columns = ['sentiment', 'text'] 
  
  
  
  datadata = data.sample(frac=1, random_state=42) 
  
  
  
  print(data.shape) 
  
  
  
  (1578614, 2) 
  
  
  
  for row in data.head(10).iterrows(): 
  
  
  
      print(row[1]['sentiment'], row[1]['text'])  
  
  
  
  1 http://www.popsugar.com/2999655 keep voting for robert pattinson in the popsugar100 as well!!  
  
  
  
  1 @GamrothTaylor I am starting to worry about you, only I have Navy Seal type sleep hours.  
  
  
  
  0 sunburned...no sunbaked!    ow.  it hurts to sit. 
  
  
  
  1 Celebrating my 50th birthday by doing exactly the same as I do every other day - working on our websites.  It's just another day.    
  
  
  
  1 Leah and Aiden Gosselin are the cutest kids on the face of the Earth  
  
  
  
  1 @MissHell23 Oh. I didn't even notice.   
  
  
  
  0 WTF is wrong with me?!!! I'm completely miserable. I need to snap out of this  
  
  
  
  0 Was having the best time in the gym until I got to the car and had messages waiting for me... back to the down stage!  
  
  
  
  1 @JENTSYY oh what happened??  
  
  
  
  0 @catawu Ghod forbid he should feel responsible for anything!

推文數(shù)據(jù)中存在很多噪聲，我們刪除了推文中的網(wǎng)址、主題標簽和用戶提及來清理數(shù)據(jù)。

 
 
 
 
  
  
  
  def tokenize(tweet): 
  
  
  
      tweet = re.sub(r'http\S+', '', tweet) 
  
  
  
      tweet = re.sub(r"#(\w+)", '', tweet) 
  
  
  
      tweet = re.sub(r"@(\w+)", '', tweet) 
  
  
  
      tweet = re.sub(r'[^\w\s]', '', tweet) 
  
  
  
      tweettweet = tweet.strip().lower() 
  
  
  
      tokens = word_tokenize(tweet) 
  
  
  
      return tokens

將清理好的數(shù)據(jù)保存在硬盤上。

 
 
 
 
  
  
  
  data['tokens'] = data.text.progress_map(tokenize) 
  
  
  
  data['cleaned_text'] = data['tokens'].map(lambda tokens: ' '.join(tokens)) 
  
  
  
  data[['sentiment', 'cleaned_text']].to_csv('./data/cleaned_text.csv') 
  
  
  
   
  
  
  
  data = pd.read_csv('./data/cleaned_text.csv') 
  
  
  
  print(data.shape) 
  
  
  
  (1575026, 2) 
  
  
  
  data.head()

既然數(shù)據(jù)集已經(jīng)清理干凈了，就可以準備分割訓練集和測試集來建立模型了。

本文數(shù)據(jù)都是用這種方式分割的。

 
 
 
 
  
  
  
  x_train, x_test, y_train, y_test = train_test_split(data['cleaned_text'],  
  
  
  
                                                      data['sentiment'],  
  
  
  
                                                      test_size=0.1,  
  
  
  
                                                      random_state=42, 
  
  
  
                                                      stratify=data['sentiment']) 
  
  
  
   
  
  
  
  print(x_train.shape, x_test.shape, y_train.shape, y_test.shape) 
  
  
  
  (1417523,) (157503,) (1417523,) (157503,)

將測試集標簽存儲在硬盤上以便后續(xù)使用。

 
 
 
 
  
  
  
  pd.DataFrame(y_test).to_csv('./predictions/y_true.csv', index=False, encoding='utf-8')

接下來就可以應用機器學習方法了。

1. 基于詞級 ngram 的詞袋模型

那么，什么是 n-gram 呢?

如圖所示，ngram 是將可在源文本中找到的長度為 n 的相鄰詞的所有組合。

我們的模型將以 unigrams(n=1)和 bigrams(n=2)為特征。

用矩陣表示數(shù)據(jù)集，矩陣的每一行表示一條推文，每一列表示從推文(已經(jīng)經(jīng)過分詞和清理)中提取的特征(一元模型或二元模型)。每個單元格是 tf-idf 分數(shù)(也可以用更簡單的值，但 tf-idf 比較通用且效果較好)。我們將該矩陣稱為文檔-詞項矩陣。

略經(jīng)思考可知，擁有 150 萬推文的語料庫的一元模型和二元模型去重后的數(shù)量還是很大的。事實上，出于計算力的考慮，我們可將這個數(shù)設置為固定值。你可以通過交叉驗證來確定這個值。

在向量化之后，語料庫如下圖所示：

 
 
 
 
  
  
  
  I like pizza a lot

假設使用上述特征讓模型對這句話進行預測。

由于我們使用的是一元模型和二元模型后，因此模型提取出了下列特征：

 
 
 
 
  
  
  
  i, like, pizza, a, lot, i like, like pizza, pizza a, a lot

因此，句子變成了大小為 N(分詞總數(shù))的向量，這個向量中包含 0 和這些 ngram 的 tf-idf 分數(shù)。所以接下來其實是要處理這個大而稀疏的向量。

一般而言，線性模型可以很好地處理大而稀疏的數(shù)據(jù)。此外，與其他模型相比，線性模型的訓練速度也更快。

從過去的經(jīng)驗可知，logistic 回歸可以在稀疏的 tf-idf 矩陣上良好地運作。

 
 
 
 
  
  
  
  vectorizer_word = TfidfVectorizer(max_features=40000, 
  
  
  
                               min_df=5,  
  
  
  
                               max_df=0.5,  
  
  
  
                               analyzer='word',  
  
  
  
                               stop_words='english',  
  
  
  
                               ngram_range=(1, 2)) 
  
  
  
   
  
  
  
  vectorizer_word.fit(x_train, leave=False) 
  
  
  
   
  
  
  
  tfidf_matrix_word_train = vectorizer_word.transform(x_train) 
  
  
  
  tfidf_matrix_word_test = vectorizer_word.transform(x_test)

在為訓練集和測試集生成了 tf-idf 矩陣后，就可以建立第一個模型并對其進行測試。

tf-idf 矩陣是 logistic 回歸的特征。

 
 
 
 
  
  
  
  lr_word = LogisticRegression(solver='sag', verbose=2) 
  
  
  
  lr_word.fit(tfidf_matrix_word_train, y_train)

一旦訓練好模型后，就可以將其應用于測試數(shù)據(jù)以獲得預測值。然后將這些值和模型一并存儲在硬盤上。

 
 
 
 
  
  
  
  joblib.dump(lr_word, './models/lr_word_ngram.pkl') 
  
  
  
   
  
  
  
  y_pred_word = lr_word.predict(tfidf_matrix_word_test) 
  
  
  
  pd.DataFrame(y_pred_word, columns=['y_pred']).to_csv('./predictions/lr_word_ngram.csv', index=False)

得到準確率：

 
 
 
 
  
  
  
  y_pred_word = pd.read_csv('./predictions/lr_word_ngram.csv') 
  
  
  
  print(accuracy_score(y_test, y_pred_word)) 
  
  
  
  0.782042246814

第一個模型得到了 78.2% 的準確率!真不賴。接下來了解一下第二個模型。

2. 基于字符級 ngram 的詞袋模型

我們從未說過 ngram 僅為詞服務，也可將其應用于字符上。

如你所見，我們將對字符級 ngram 使用與圖中一樣的代碼，現(xiàn)在直接來看 4-grams 建模。

基本上這意味著，像「I like this movie」這樣的句子會有下列特征：

 
 
 
 
  
  
  
  I, l, i, k, e, ..., I li, lik, like, ..., this, ... , is m, s mo, movi, ...

字符級 ngram 很有效，在語言建模任務中，甚至可以比分詞表現(xiàn)得更好。像垃圾郵件過濾或自然語言識別這樣的任務就高度依賴字符級 ngram。

與之前學習單詞組合的模型不同，該模型學習的是字母組合，這樣就可以處理單詞的形態(tài)構成。

基于字符的表征的一個優(yōu)勢是可以更好地解決單詞拼寫錯誤的問題。

我們來運行同樣的流程：

 
 
 
 
  
  
  
  vectorizer_char = TfidfVectorizer(max_features=40000, 
  
  
  
                               min_df=5,  
  
  
  
                               max_df=0.5,  
  
  
  
                               analyzer='char',  
  
  
  
                               ngram_range=(1, 4)) 
  
  
  
   
  
  
  
  vectorizer_char.fit(tqdm_notebook(x_train, leave=False)); 
  
  
  
   
  
  
  
  tfidf_matrix_char_train = vectorizer_char.transform(x_train) 
  
  
  
  tfidf_matrix_char_test = vectorizer_char.transform(x_test) 
  
  
  
   
  
  
  
  lr_char = LogisticRegression(solver='sag', verbose=2) 
  
  
  
  lr_char.fit(tfidf_matrix_char_train, y_train) 
  
  
  
   
  
  
  
  y_pred_char = lr_char.predict(tfidf_matrix_char_test) 
  
  
  
  joblib.dump(lr_char, './models/lr_char_ngram.pkl') 
  
  
  
   
  
  
  
  pd.DataFrame(y_pred_char, columns=['y_pred']).to_csv('./predictions/lr_char_ngram.csv', index=False) 
  
  
  
  y_pred_char = pd.read_csv('./predictions/lr_char_ngram.csv') 
  
  
  
  print(accuracy_score(y_test, y_pred_char)) 
  
  
  
  0.80420055491

80.4% 的準確率!字符級 ngram 模型的性能要比詞級的 ngram 更好。

3. 基于詞級 ngram 和字符級 ngram 的詞袋模型

與詞級 ngram 的特征相比，字符級 ngram 特征似乎提供了更好的準確率。那么將字符級 ngram 和詞級 ngram 結合效果又怎么樣呢?

我們將兩個 tf-idf 矩陣連接在一起，建立一個新的、混合 tf-idf 矩陣。該模型有助于學習單詞形態(tài)結構以及與這個單詞大概率相鄰單詞的形態(tài)結構。

將這些屬性結合在一起。

 
 
 
 
  
  
  
  tfidf_matrix_word_char_train =  hstack((tfidf_matrix_word_train, tfidf_matrix_char_train)) 
  
  
  
  tfidf_matrix_word_char_test =  hstack((tfidf_matrix_word_test, tfidf_matrix_char_test)) 
  
  
  
   
  
  
  
  lr_word_char = LogisticRegression(solver='sag', verbose=2) 
  
  
  
  lr_word_char.fit(tfidf_matrix_word_char_train, y_train) 
  
  
  
   
  
  
  
  y_pred_word_char = lr_word_char.predict(tfidf_matrix_word_char_test) 
  
  
  
  joblib.dump(lr_word_char, './models/lr_word_char_ngram.pkl') 
  
  
  
   
  
  
  
  pd.DataFrame(y_pred_word_char, columns=['y_pred']).to_csv('./predictions/lr_word_char_ngram.csv', index=False) 
  
  
  
  y_pred_word_char = pd.read_csv('./predictions/lr_word_char_ngram.csv') 
  
  
  
  print(accuracy_score(y_test, y_pred_word_char)) 
  
  
  
  0.81423845895

得到了 81.4% 的準確率。該模型只加了一個整體單元，但結果比之前的兩個都要好。

關于詞袋模型

優(yōu)點：考慮到其簡單的特性，詞袋模型已經(jīng)很強大了，它們訓練速度快，且易于理解。
缺點：即使 ngram 帶有一些單詞間的語境，但詞袋模型無法建模序列中單詞間的長期依賴關系。

現(xiàn)在要用到深度學習模型了。深度學習模型的表現(xiàn)優(yōu)于詞袋模型是因為深度學習模型能夠捕捉到句子中單詞間的順序依賴關系。這可能要歸功于循環(huán)神經(jīng)網(wǎng)絡這一特殊神經(jīng)網(wǎng)絡結構的出現(xiàn)了。

本文并未涵蓋 RNN 的理論基礎，但該鏈接(http://colah.github.io/posts/2015-08-Understanding-LSTMs/)中的內(nèi)容值得一讀。這篇文章來源于 Cristopher Olah 的博客，詳細敘述了一種特殊的 RNN 模型：長短期記憶網(wǎng)絡(LSTM)。

在開始之前，要先設置一個深度學習專用的環(huán)境，以便在 TensorFlow 上使用 Keras。誠實地講，我試著在個人筆記本上運行這些代碼，但考慮到數(shù)據(jù)集的大小和 RNN 架構的復雜程度，這是很不實際的。還有一個很好的選擇是 AWS。我一般在 EC2 p2.xlarge 實例上用深度學習 AMI(https://aws.amazon.com/marketplace/pp/B077GCH38C?qid=1527197041958&sr=0-1&ref_=srh_res_product_title)。亞馬遜 AMI 是安裝了所有包(TensorFlow、PyTorch 和 Keras 等)的預先配置過的 VM 圖。強烈推薦大家使用!

 
 
 
 
  
  
  
  from keras.preprocessing.text import Tokenizer 
  
  
  
  from keras.preprocessing.text import text_to_word_sequence 
  
  
  
  from keras.preprocessing.sequence import pad_sequences 
  
  
  
   
  
  
  
  from keras.models import Model 
  
  
  
  from keras.models import Sequential 
  
  
  
   
  
  
  
  from keras.layers import Input, Dense, Embedding, Conv1D, Conv2D, MaxPooling1D, MaxPool2D 
  
  
  
  from keras.layers import Reshape, Flatten, Dropout, Concatenate 
  
  
  
  from keras.layers import SpatialDropout1D, concatenate 
  
  
  
  from keras.layers import GRU, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D 
  
  
  
   
  
  
  
  from keras.callbacks import Callback 
  
  
  
  from keras.optimizers import Adam 
  
  
  
   
  
  
  
  from keras.callbacks import ModelCheckpoint, EarlyStopping 
  
  
  
  from keras.models import load_model 
  
  
  
  from keras.utils.vis_utils import plot_model

4. 沒有預訓練詞嵌入的循環(huán)神經(jīng)網(wǎng)絡

RNN 可能看起來很可怕。盡管它們因為復雜而難以理解，但非常有趣。RNN 模型封裝了一個非常漂亮的設計，以克服傳統(tǒng)神經(jīng)網(wǎng)絡在處理序列數(shù)據(jù)(文本、時間序列、視頻、DNA 序列等)時的短板。

RNN 是一系列神經(jīng)網(wǎng)絡的模塊，它們彼此連接像鎖鏈一樣。每一個都將消息向后傳遞。強烈推薦大家從 Colah 的博客中深入了解它的內(nèi)部機制，下面的圖就來源于此。

我們要處理的序列類型是文本數(shù)據(jù)。對意義而言，單詞順序很重要。RNN 考慮到了這一點，它可以捕捉長期依賴關系。

為了在文本數(shù)據(jù)上使用 Keras，我們首先要對數(shù)據(jù)進行預處理?？梢杂?Keras 的 Tokenizer 類。該對象用 num_words 作為參數(shù)，num_words 是根據(jù)詞頻進行分詞后保留下來的最大詞數(shù)。

 
 
 
 
  
  
  
  MAX_NB_WORDS = 80000 
  
  
  
  tokenizer = Tokenizer(num_words=MAX_NB_WORDS) 
  
  
  
   
  
  
  
  tokenizer.fit_on_texts(data['cleaned_text'])

當分詞器適用于數(shù)據(jù)時，我們就可以用分詞器將文本字符級 ngram 轉換為數(shù)字序列。

這些數(shù)字表示每個單詞在字典中的位置(將其視為映射)。

如下例所示：

 
 
 
 
  
  
  
  x_train[15] 
  
  
  
  'breakfast time happy time'

這里說明了分詞器是如何將其轉換為數(shù)字序列的。

 
 
 
 
  
  
  
  tokenizer.texts_to_sequences([x_train[15]]) 
  
  
  
  [[530, 50, 119, 50]]

接下來在訓練序列和測試序列中應用該分詞器：

 
 
 
 
  
  
  
  train_sequences = tokenizer.texts_to_sequences(x_train) 
  
  
  
  test_sequences = tokenizer.texts_to_sequences(x_test)

將推文映射到整數(shù)列表中。但是由于長度不同，還是沒法將它們在矩陣中堆疊在一起。還好 Keras 允許用 0 將序列填充至最大長度。我們將這個長度設置為 35(這是推文中的最大分詞數(shù))。

 
 
 
 
  
  
  
  MAX_LENGTH = 35 
  
  
  
  padded_train_sequences = pad_sequences(train_sequences, maxlen=MAX_LENGTH) 
  
  
  
  padded_test_sequences = pad_sequences(test_sequences, maxlen=MAX_LENGTH) 
  
  
  
  padded_train_sequences 
  
  
  
  array([[    0,     0,     0, ...,  2383,   284,     9], 
  
  
  
         [    0,     0,     0, ...,    13,    30,    76], 
  
  
  
         [    0,     0,     0, ...,    19,    37, 45231], 
  
  
  
         ...,  
  
  
  
         [    0,     0,     0, ...,    43,   502,  1653], 
  
  
  
         [    0,     0,     0, ...,     5,  1045,   890], 
  
  
  
         [    0,     0,     0, ..., 13748, 38750,   154]]) 
  
  
  
  padded_train_sequences.shape 
  
  
  
  (1417523, 35)

現(xiàn)在就可以將數(shù)據(jù)傳入 RNN 了。

以下是我將使用的架構的一些元素：

嵌入維度為 300。這意味著我們使用的 8 萬個單詞中的每一個都被映射至 300 維的密集(浮點數(shù))向量。該映射將在訓練過程中進行調(diào)整。
在嵌入層上應用 spatial dropout 層以減少過擬合：按批次查看 35*300 的矩陣，隨機刪除每個矩陣中(設置為 0)的詞向量(行)。這有助于將注意力不集中在特定的詞語上，有利于模型的泛化。
雙向門控循環(huán)單元(GRU)：這是循環(huán)網(wǎng)絡部分。這是 LSTM 架構更快的變體。將其視為兩個循環(huán)網(wǎng)絡的組合，這樣就可以從兩個方向同時掃描文本序列：從左到右和從右到左。這使得網(wǎng)絡在閱讀給定單詞時，可以結合之前和之后的內(nèi)容理解文本。GRU 中每個網(wǎng)絡塊的輸出 h_t 的維度即單元數(shù)，將這個值設置為 100。由于用了雙向 GRU，因此每個 RNN 塊的最終輸出都是 200 維的。

雙向 GRU 的輸出是有維度的(批尺寸、時間步和單元)。這意味著如果用的是經(jīng)典的 256 的批尺寸，維度將會是 (256, 35, 200)。

在每個批次上應用的是全局平均池化，其中包含了每個時間步(即單詞)對應的輸出向量的平均值。
我們應用了相同的操作，只是用最大池化替代了平均池化。
將前兩個操作的輸出連接在了一起。

 
 
 
 
  
  
  
  def get_simple_rnn_model(): 
  
  
  
      embedding_dim = 300 
  
  
  
      embedding_matrix = np.random.random((MAX_NB_WORDS, embedding_dim)) 
  
  
  
   
  
  
  
      inp = Input(shape=(MAX_LENGTH, )) 
  
  
  
      x = Embedding(input_dim=MAX_NB_WORDS, output_dim=embedding_dim, input_length=MAX_LENGTH,  
  
  
  
                    weights=[embedding_matrix], trainable=True)(inp) 
  
  
  
      x = SpatialDropout1D(0.3)(x) 
  
  
  
      x = Bidirectional(GRU(100, return_sequences=True))(x) 
  
  
  
      avg_pool = GlobalAveragePooling1D()(x) 
  
  
  
      max_pool = GlobalMaxPooling1D()(x) 
  
  
  
      conc = concatenate([avg_pool, max_pool]) 
  
  
  
      outp = Dense(1, activation="sigmoid")(conc) 
  
  
  
   
  
  
  
      model = Model(inpinputs=inp, outpoutputs=outp) 
  
  
  
      model.compile(loss='binary_crossentropy', 
  
  
  
                    optimizer='adam', 
  
  
  
                    metrics=['accuracy']) 
  
  
  
      return model 
  
  
  
   
  
  
  
  rnn_simple_model = get_simple_rnn_model()

該模型的不同層如下所示：

 
 
 
 
  
  
  
  plot_model(rnn_simple_model,  
  
  
  
             to_file='./images/article_5/rnn_simple_model.png',  
  
  
  
             show_shapes=True,  
  
  
  
             show_layer_names=True)

在訓練期間使用了模型檢查點。這樣可以在每個 epoch 的最后將最佳模型(可以用準確率度量)自動存儲(在硬盤上)。

 
 
 
 
  
  
  
  filepath="./models/rnn_no_embeddings/weights-improvement-{epoch:02d}-{val_acc:.4f}.hdf5" 
  
  
  
  checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max') 
  
  
  
   
  
  
  
  batch_size = 256 
  
  
  
  epochs = 2 
  
  
  
   
  
  
  
  history = rnn_simple_model.fit(x=padded_train_sequences,  
  
  
  
                      y=y_train,  
  
  
  
                      validation_data=(padded_test_sequences, y_test),  
  
  
  
                      batch_sizebatch_size=batch_size,  
  
  
  
                      callbacks=[checkpoint],  
  
  
  
                      epochsepochs=epochs,  
  
  
  
                      verbose=1) 
  
  
  
   
  
  
  
  best_rnn_simple_model = load_model('./models/rnn_no_embeddings/weights-improvement-01-0.8262.hdf5') 
  
  
  
   
  
  
  
  y_pred_rnn_simple = best_rnn_simple_model.predict(padded_test_sequences, verbose=1, batch_size=2048) 
  
  
  
   
  
  
  
  y_pred_rnn_simple = pd.DataFrame(y_pred_rnn_simple, columns=['prediction']) 
  
  
  
  y_pred_rnn_simple['prediction'] = y_pred_rnn_simple['prediction'].map(lambda p: 1 if p >= 0.5 else 0) 
  
  
  
  y_pred_rnn_simple.to_csv('./predictions/y_pred_rnn_simple.csv', index=False) 
  
  
  
  y_pred_rnn_simple = pd.read_csv('./predictions/y_pred_rnn_simple.csv') 
  
  
  
  print(accuracy_score(y_test, y_pred_rnn_simple)) 
  
  
  
  0.826219183127

準確率達到了 82.6%!這真是很不錯的結果了!現(xiàn)在的模型表現(xiàn)已經(jīng)比之前的詞袋模型更好了，因為我們將文本的序列性質考慮在內(nèi)了。

還能做得更好嗎?

5. 用 GloVe 預訓練詞嵌入的循環(huán)神經(jīng)網(wǎng)絡

在最后一個模型中，嵌入矩陣被隨機初始化了。那么如果用預訓練過的詞嵌入對其進行初始化又當如何呢?舉個例子：假設在語料庫中有「pizza」這個詞。遵循之前的架構對其進行初始化后，可以得到一個 300 維的隨機浮點值向量。這當然是很好的。這很好實現(xiàn)，而且這個嵌入可以在訓練過程中進行調(diào)整。但你還可以使用在很大的語料庫上訓練出來的另一個模型，為「pizza」生成詞嵌入來代替隨機選擇的向量。這是一種特殊的遷移學習。

使用來自外部嵌入的知識可以提高 RNN 的精度，因為它整合了這個單詞的相關新信息(詞匯和語義)，而這些信息是基于大規(guī)模數(shù)據(jù)語料庫訓練和提煉出來的。

我們使用的預訓練嵌入是 GloVe。

官方描述是這樣的：GloVe 是一種獲取單詞向量表征的無監(jiān)督學習算法。該算法的訓練基于語料庫全局詞-詞共現(xiàn)數(shù)據(jù)，得到的表征展示出詞向量空間有趣的線性子結構。

本文使用的 GloVe 嵌入的訓練數(shù)據(jù)是數(shù)據(jù)量很大的網(wǎng)絡抓取，包括：

8400 億個分詞;
220 萬詞。

下載壓縮文件要 2.03GB。請注意，該文件無法輕松地加載在標準筆記本電腦上。

GloVe 嵌入有 300 維。

GloVe 嵌入來自原始文本數(shù)據(jù)，在該數(shù)據(jù)中每一行都包含一個單詞和 300 個浮點數(shù)(對應嵌入)。所以首先要將這種結構轉換為 Python 字典。

 
 
 
 
  
  
  
  def get_coefs(word, *arr): 
  
  
  
      try: 
  
  
  
          return word, np.asarray(arr, dtype='float32') 
  
  
  
      except: 
  
  
  
          return None, None 
  
  
  
   
  
  
  
  embeddings_index = dict(get_coefs(*o.strip().split()) for o in tqdm_notebook(open('./embeddings/glove.840B.300d.txt'))) 
  
  
  
   
  
  
  
  embed_size=300 
  
  
  
  for k in tqdm_notebook(list(embeddings_index.keys())): 
  
  
  
      v = embeddings_index[k] 
  
  
  
      try: 
  
  
  
          if v.shape != (embed_size, ): 
  
  
  
              embeddings_index.pop(k) 
  
  
  
      except: 
  
  
  
          pass 
  
  
  
   
  
  
  
  embeddings_index.pop(None)

一旦創(chuàng)建了嵌入索引，我們就可以提取所有的向量，將其堆疊在一起并計算它們的平均值和標準差。

 
 
 
 
  
  
  
  values = list(embeddings_index.values()) 
  
  
  
  all_embs = np.stack(values) 
  
  
  
   
  
  
  
  emb_mean, emb_std = all_embs.mean(), all_embs.std()

現(xiàn)在生成了嵌入矩陣。按照 mean=emb_mean 和 std=emb_std 的正態(tài)分布對矩陣進行初始化。遍歷語料庫中的 80000 個單詞。對每一個單詞而言，如果這個單詞存在于 GloVe 中，我們就可以得到這個單詞的嵌入，如果不存在那就略過。

 
 
 
 
  
  
  
  word_index = tokenizer.word_index 
  
  
  
  nb_words = MAX_NB_WORDS 
  
  
  
  embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size)) 
  
  
  
   
  
  
  
  oov = 0 
  
  
  
  for word, i in tqdm_notebook(word_index.items()): 
  
  
  
      if i >= MAX_NB_WORDS: continue 
  
  
  
      embedding_vector = embeddings_index.get(word) 
  
  
  
      if embedding_vector is not None: 
  
  
  
          embedding_matrix[i] = embedding_vector 
  
  
  
      else: 
  
  
  
          oov += 1 
  
  
  
   
  
  
  
  print(oov) 
  
  
  
   
  
  
  
  def get_rnn_model_with_glove_embeddings(): 
  
  
  
      embedding_dim = 300 
  
  
  
      inp = Input(shape=(MAX_LENGTH, )) 
  
  
  
      x = Embedding(MAX_NB_WORDS, embedding_dim, weights=[embedding_matrix], input_length=MAX_LENGTH, trainable=True)(inp) 
  
  
  
      x = SpatialDropout1D(0.3)(x) 
  
  
  
      x = Bidirectional(GRU(100, return_sequences=True))(x) 
  
  
  
      avg_pool = GlobalAveragePooling1D()(x) 
  
  
  
      max_pool = GlobalMaxPooling1D()(x) 
  
  
  
      conc = concatenate([avg_pool, max_pool]) 
  
  
  
      outp = Dense(1, activation="sigmoid")(conc) 
  
  
  
   
  
  
  
      model = Model(inpinputs=inp, outpoutputs=outp) 
  
  
  
      model.compile(loss='binary_crossentropy', 
  
  
  
                    optimizer='adam', 
  
  
  
                    metrics=['accuracy']) 
  
  
  
      return model 
  
  
  
   
  
  
  
  rnn_model_with_embeddings = get_rnn_model_with_glove_embeddings() 
  
  
  
   
  
  
  
  filepath="./models/rnn_with_embeddings/weights-improvement-{epoch:02d}-{val_acc:.4f}.hdf5" 
  
  
  
  checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max') 
  
  
  
   
  
  
  
  batch_size = 256 
  
  
  
  epochs = 4 
  
  
  
   
  
  
  
  history = rnn_model_with_embeddings.fit(x=padded_train_sequences,  
  
  
  
                      y=y_train,  
  
  
  
                      validation_data=(padded_test_sequences, y_test),  
  
  
  
                      batch_sizebatch_size=batch_size,  
  
  
  
                      callbacks=[checkpoint],  
  
  
  
                      epochsepochs=epochs,  
  
  
  
                      verbose=1) 
  
  
  
   
  
  
  
  best_rnn_model_with_glove_embeddings = load_model('./models/rnn_with_embeddings/weights-improvement-03-0.8372.hdf5') 
  
  
  
   
  
  
  
  y_pred_rnn_with_glove_embeddings = best_rnn_model_with_glove_embeddings.predict( 
  
  
  
      padded_test_sequences, verbose=1, batch_size=2048) 
  
  
  
   
  
  
  
  y_pred_rnn_with_glove_embeddings = pd.DataFrame(y_pred_rnn_with_glove_embeddings, columns=['prediction']) 
  
  
  
  y_pred_rnn_with_glove_embeddings['prediction'] = y_pred_rnn_with_glove_embeddings['prediction'].map(lambda p:  
  
  
  
                                                                                                      1 if p >= 0.5 else 0) 
  
  
  
  y_pred_rnn_with_glove_embeddings.to_csv('./predictions/y_pred_rnn_with_glove_embeddings.csv', index=False) 
  
  
  
  y_pred_rnn_with_glove_embeddings = pd.read_csv('./predictions/y_pred_rnn_with_glove_embeddings.csv') 
  
  
  
  print(accuracy_score(y_test, y_pred_rnn_with_glove_embeddings)) 
  
  
  
  0.837203100893

準確率達到了 83.7%!來自外部詞嵌入的遷移學習起了作用!本教程剩余部分都會在嵌入矩陣中使用 GloVe 嵌入。

6. 多通道卷積神經(jīng)網(wǎng)絡

這一部分實驗了我曾了解過的卷積神經(jīng)網(wǎng)絡結構(http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/)。CNN 常用于計算機視覺任務。但最近我試著將其應用于 NLP 任務，而結果也希望滿滿。

簡要了解一下當在文本數(shù)據(jù)上使用卷積網(wǎng)絡時會發(fā)生什么。為了解釋這一點，我從 wildm.com(一個很好的博客)中找到了這張非常有名的圖(如下所示)。

了解一下使用的例子：I like this movie very much!(7 個分詞)

每個單詞的嵌入維度是 5。因此，可以用一個維度為 (7,5 的矩陣表示這句話。你可以將其視為一張「圖」(數(shù)字或浮點數(shù)的矩陣)。
6 個濾波器，大小為 (2, 5) (3, 5) 和 (4, 5) 的濾波器各兩個。這些濾波器應用于該矩陣上，它們的特殊之處在于都不是方矩陣，但它們的寬度和嵌入矩陣的寬度相等。所以每個卷積的結果將是一個列向量。
卷積產(chǎn)生的每一列向量都使用了最大池化操作進行下采樣。
將最大池化操作的結果連接至將要傳遞給 softmax 函數(shù)進行分類的最終向量。

二、背后的原理是什么?

檢測到特殊模式會激活每一次卷積的結果。通過改變卷積核的大小和連接它們的輸出，你可以檢測多個尺寸(2 個、3 個或 5 個相鄰單詞)的模式。

模式可以是像是「我討厭」、「非常好」這樣的表達式(詞級的 ngram?)，因此 CNN 可以在不考慮其位置的情況下從句子中分辨它們。

 
 
 
 
  
  
  
  def get_cnn_model(): 
  
  
  
      embedding_dim = 300 
  
  
  
   
  
  
  
      filter_sizes = [2, 3, 5] 
  
  
  
      num_filters = 256 
  
  
  
      drop = 0.3 
  
  
  
   
  
  
  
      inputs = Input(shape=(MAX_LENGTH,), dtype='int32') 
  
  
  
      embedding = Embedding(input_dim=MAX_NB_WORDS, 
  
  
  
                                  output_dim=embedding_dim, 
  
  
  
                                  weights=[embedding_matrix], 
  
  
  
                                  input_length=MAX_LENGTH, 
  
  
  
                                  trainable=True)(inputs) 
  
  
  
   
  
  
  
      reshape = Reshape((MAX_LENGTH, embedding_dim, 1))(embedding) 
  
  
  
      conv_0 = Conv2D(num_filters,  
  
  
  
                      kernel_size=(filter_sizes[0], embedding_dim),  
  
  
  
                      padding='valid', kernel_initializer='normal',  
  
  
  
                      activation='relu')(reshape) 
  
  
  
   
  
  
  
      conv_1 = Conv2D(num_filters,  
  
  
  
                      kernel_size=(filter_sizes[1], embedding_dim),  
  
  
  
                      padding='valid', kernel_initializer='normal',  
  
  
  
                      activation='relu')(reshape) 
  
  
  
      conv_2 = Conv2D(num_filters,  
  
  
  
                      kernel_size=(filter_sizes[2], embedding_dim),  
  
  
  
                      padding='valid', kernel_initializer='normal',  
  
  
  
                      activation='relu')(reshape) 
  
  
  
   
  
  
  
      maxpool_0 = MaxPool2D(pool_size=(MAX_LENGTH - filter_sizes[0] + 1, 1),  
  
  
  
                           &                                                

                                                網(wǎng)站題目：CNN也能用于NLP任務，一文簡述文本分類任務的7個模型                                                

                                                地址分享：http://m.fisionsoft.com.cn/article/cojjpch.html

新聞中心

其他資訊