用RNN处理单词向量

阅读: 评论:0

用RNN处理单词向量

用RNN处理单词向量

总结

本文包含以下内容:

  • 连接诶单词向量(Word Embedding)
  • 使用RNN结构
  • 使用内容窗口

代码-引用-参考

代码

代码下载

参考文献

  • Grégoire Mesnil, Xiaodong He, Li Deng and Yoshua Bengio. Investigation of Recurrent-Neural-Network Architectures and Learning Methods for Spoken Language Understanding. Interspeech, 2013.
  • Gokhan Tur, Dilek Hakkani-Tur and Larry Heck. What is left to be understood in ATIS?
  • Christian Raymond and Giuseppe Riccardi. Generative and discriminative algorithms for spoken language understanding. Interspeech, 2007.
  • Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2012.
  • Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010.

目的

分类任务:给出每句话中单词的含义

数据库

ATIS (Airline Travel Information System) dataset collected by DARPA.
ATIS数据库包含4978/893个句子,其中包括56590/9198个单词。单词的标签以IOB的形式给出。

RNN模型

原始输入编码

每个标记代表一个单词。ATIS将单词与转化为单词表的书号。每个句子都是int32的数组。例如

>>> sentence
array([383, 189,  13, 193, 208, 307, 195, 502, 260, 539,7,  60,  72, 8, 350, 384], dtype=int32)
>>> map(lambda x: index2word[x], sentence)
['please', 'find', 'a', 'flight', 'from', 'miami', 'florida','to', 'las', 'vegas', '<UNK>', 'arriving', 'before', 'DIGIT', "o'clock", 'pm']

标签业已同样的方式与输入数据关联:

>>> labels
array([126, 126, 126, 126, 126,  48,  50, 126,  78, 123,  81, 126,  15,14,  89,  89], dtype=int32)
>>> map(lambda x: index2label[x], labels)
['O', 'O', 'O', 'O', 'O', 'B-fromloc.city_name', 'B-fromloc.state_name','O', 'B-toloc.city_name', 'I-toloc.city_name', 'B-toloc.state_name','O', 'B-arrive_time.time_relative', 'B-arrive_time.time','I-arrive_time.time', 'I-arrive_time.time']

内容窗口

内容窗口用来将句子中的单词转化为固定长度的数据序列,具体实现如下:

def contextwin(l, win):'''win :: int corresponding to the size of the windowgiven a list of indexes composing a sentencel :: array containing the word indexesit will return a list of list of indexes correspondingto context windows surrounding each word in the sentence'''assert (win % 2) == 1assert win >= 1l = list(l)lpadded = win // 2 * [-1] + l + win // 2 * [-1]out = [lpadded[i:(i + win)] for i in range(len(l))]assert len(out) == len(l)return out

其中,-1作为PADDING来补充不足的数据,处理过后的数据如下:

>>> x
array([0, 1, 2, 3, 4], dtype=int32)
>>> contextwin(x, 3)
[[-1, 0, 1],[ 0, 1, 2],[ 1, 2, 3],[ 2, 3, 4],[ 3, 4,-1]]
>>> contextwin(x, 7)
[[-1, -1, -1, 0, 1, 2, 3],[-1, -1,  0, 1, 2, 3, 4],[-1,  0,  1, 2, 3, 4,-1],[ 0,  1,  2, 3, 4,-1,-1],[ 1,  2,  3, 4,-1,-1,-1]]

单词向量Word embeddings

经过内容窗口的处理后,单词和句子转化为了数组,接着需要和embedding关联起来。具体数显如下:

import theano, numpy
from theano import tensor as T# nv :: size of our vocabulary
# de :: dimension of the embedding space
# cs :: context window size
nv, de, cs = 1000, 50, 5embeddings = theano.shared(0.2 * numpy.random.uniform(-1.0, 1.0, (nv+1, de)).fig.floatX)) # add one for PADDING at the endidxs = T.imatrix() # as many columns as words in the context window and as many lines as words in the sentence
x    = b[idxs].reshape((idxs.shape[0], de*cs))

E-RNN

前面的处理将原始的输入处理为时序或者序列数据。E-RNN对当前数据(t)和上一时间节点数据(t-1)递归。
E-RNN要学习的数据包括:
- 单词向量(word embedding)
- 初始隐藏状态
- 输入一上一隐藏层的线性映射矩阵
- 偏移(可选)
- 顶层的softmax分类

RNN结构的全局变量定义为:
- 单词向量的维度
- 字典的大小
- 隐节点个数
- 分类个数
- 随机数种子和模型的初始化方法

具体实现如下:

class RNNSLU(object):''' elman neural net model '''def __init__(self, nh, nc, ne, de, cs):'''nh :: dimension of the hidden layernc :: number of classesne :: number of word embeddings in the vocabularyde :: dimension of the word embeddingscs :: word window context size'''# parameters of b = theano.shared(name='embeddings',value=0.2 * numpy.random.uniform(-1.0, 1.0,(ne+1, de))# add one for padding at the end.fig.floatX))self.wx = theano.shared(name='wx',value=0.2 * numpy.random.uniform(-1.0, 1.0,(de * cs, nh)).fig.floatX))self.wh = theano.shared(name='wh',value=0.2 * numpy.random.uniform(-1.0, 1.0,(nh, nh)).fig.floatX))self.w = theano.shared(name='w',value=0.2 * numpy.random.uniform(-1.0, 1.0,(nh, nc)).fig.floatX))self.bh = theano.shared(name='bh',value&#s(nh,dtype&#fig.floatX))self.b = theano.shared(name='b',value&#s(nc,dtype&#fig.floatX))self.h0 = theano.shared(name='h0',value&#s(nh,dtype&#fig.floatX))# bundleself.params = [b, self.wx, self.wh, self.w,self.bh, self.b, self.h0]

接着,从单词向量中生成输入向量:

        idxs = T.imatrix()x = b[idxs].reshape((idxs.shape[0], de*cs))y_sentence = T.ivector('y_sentence')  # labels

theano.scan函数生成递归式:

        def recurrence(x_t, h_tm1):h_t = T.nnet.sigmoid(T.dot(x_t, self.wx)+ T.dot(h_tm1, self.wh) + self.bh)s_t = T.nnet.softmax(T.dot(h_t, self.w) + self.b)return [h_t, s_t][h, s], _ = theano.scan(fn=recurrence,sequences=x,outputs_info=[self.h0, None],n_steps=x.shape[0])p_y_given_x_sentence = s[:, 0, :]y_pred = T.argmax(p_y_given_x_sentence, axis=1)

Theano会自动计算各个参数的梯度来最大化对数化的损失函数

        lr = T.scalar('lr')sentence_nll = -T.mean(T.log(p_y_given_x_sentence)[T.arange(x.shape[0]), y_sentence])sentence_gradients = T.grad(sentence_nll, self.params)sentence_updates = OrderedDict((p, p - lr*g)for p, g inzip(self.params, sentence_gradients))

接着,将这些函数压缩在一起:

        self.classify = theano.function(inputs=[idxs], outputs=y_pred)self.sentence_train = theano.function(inputs=[idxs, y_sentence, lr],outputs=sentence_nll,updates=sentence_updates)

每次更新参数都要把单词向量归一化,使他们保持在单位球面上:

         alize = theano.function(inputs=[],updates={b /T.sqrt((b**2).sum(axis=1)).dimshuffle(0, 'x')})

评估

评估参考真是标签与预测的标签的准确性。

训练

更新

本文使用批次SGD方法

停止标准

分离出一部分数据作为验证数据,始终保留最好的模型

全局函数的选择

  • learning rate : uniform([0.05,0.01])
  • window size : random value from {3,…,19}
  • number of hidden units : random value from {100,200}
  • embedding dimension : random value from {50,100}

运行代码

ython code/rnnslu.py('NEW BEST: epoch', 25, 'valid F1', 96.84, 'best test F1', 93.79)
[learning] epoch 26 >> 100.00% completed in 28.76 (sec) <<
[learning] epoch 27 >> 100.00% completed in 28.76 (sec) <<
...
('BEST RESULT: epoch', 57, 'valid F1', 97.23, 'best test F1', 94.2, 'with the model', 'rnnslu')
Timing

效率

i7 CPU 950 @ 3.07GHz环境下,不多于40s,200M内存
###性能
python
NEW BEST: epoch 28 valid F1 96.61 best test F1 94.19
NEW BEST: epoch 29 valid F1 96.63 best test F1 94.42
[learning] epoch 30 >> 100.00% completed in 35.04 (sec) <<
[learning] epoch 31 >> 100.00% completed in 34.80 (sec) <<
[...]
NEW BEST: epoch 40 valid F1 97.25 best test F1 94.34
[learning] epoch 41 >> 100.00% completed in 35.18 (sec) <<
NEW BEST: epoch 42 valid F1 97.33 best test F1 94.48
[learning] epoch 43 >> 100.00% completed in 35.39 (sec) <<
[learning] epoch 44 >> 100.00% completed in 35.31 (sec) <<
[...]

本文发布于:2024-01-31 05:42:46,感谢您对本站的认可!

本文链接:https://www.4u4v.net/it/170665097025967.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:向量   单词   RNN
留言与评论(共有 0 条评论)
   
验证码:

Copyright ©2019-2022 Comsenz Inc.Powered by ©

网站地图1 网站地图2 网站地图3 网站地图4 网站地图5 网站地图6 网站地图7 网站地图8 网站地图9 网站地图10 网站地图11 网站地图12 网站地图13 网站地图14 网站地图15 网站地图16 网站地图17 网站地图18 网站地图19 网站地图20 网站地图21 网站地图22/a> 网站地图23