这是一个文本分类的系列专题,将采用不同的方法有简单到复杂实现文本分类。
使用Stanford sentiment treebank 电影评论数据集 (Socher et al. 2013). 数据集可以从这里下载
链接:数据集
提取码:yeqw
代码请参考:文本分类
我们用X = {x_1, x_2,x_3…x_n}表示一个句子,x_t是句子中的第t个单词,我们使用emb来表示单词的embedding函数,也就是说 emb(x)返回一个d维度的词向量。
首先我们定义一个word_averaging 句子encoder:
h a v g = 1 / ∣ x ∣ ∗ ∑ t e m b ( x t ) h_{avg} = 1/|x| * sum_t emb(x_t) havg=1/∣x∣∗t∑emb(xt)
然后,这个句子是正面情感的概率就是:
p o s = σ ( W T ∗ h a v g ) pos = sigma(W^T * h_{avg}) pos=σ(WT∗havg)
sigma是逻辑斯蒂函数, w 是一个d维向量。如果,pos>=0.5分类器就返回正面的情感,否则就返回负面情感.
在训练的时候我们使用binary log loss。整个模型的参数就是embedding函数 emb 和向量 w 。注意词向量的维度 d 和 w 的维度必须相同。有些单词可能在DEV和TEST中出现,但是没有在TRAIN当中出现。针对这些单词,我们可以随机生成一个词向量(一个特殊的UNK词向量)。不过在初始化词向量的时候,注意不要初始化太大的范围,否则这些unknown words的norm太大可能会导致模型效果变差(所以这里我们将词向量初始化为-0.1到0.1之间的随机数)
import random
from collections import Counter
import torch
as nn
import torch.optim as optim
functional as F
USE_CUDA = torch.cuda.is_available()
device = torch.device('cuda' if USE_CUDA else 'cpu')
with open(ain.tsv','r') as rf:lines = rf.readlines()
print(lines[:10])
[‘hide new secretions from the parental unitst0n’, ‘contains no wit , only labored gagst0n’, ‘that loves its characters and communicates something rather beautiful about human naturet1n’, ‘remains utterly satisfied to remain the same throughoutt0n’, ‘on the worst revenge-of-the-nerds clich茅s the filmmakers could dredge upt0n’, “that 's far too tragic to merit such superficial treatmentt0n”, ‘demonstrates that the director of such Hollywood blockbusters as Patriot Games can still turn out a small , personal film with an emotional wallop .t1n’, ‘of saucyt1n’, “a depressed fifteen-year-old 's suicidal poetryt0n”, “are more deeply thought through than in most ` right-thinking ’ filmst1n”]
def read_corpus(path):sentences = []labels = []with open(path,'r', encoding='utf-8') as f:for line in f:sentence, label = line.split('t')sentences.append(sentence.lower().split())labels.append(label[0])return sentences, labels
train_path,dev_path,test_path = ain.tsv','senti.dev.tsv',st.tsv'
train_sentences, train_labels = read_corpus(train_path)
dev_sentences, dev_labels = read_corpus(dev_path)
test_sentences, test_labels = read_corpus(test_path)
print(len(train_sentences)), print(len(train_labels))
67349
67349
train_sentences[1], train_labels[1]
([‘contains’, ‘no’, ‘wit’, ‘,’, ‘only’, ‘labored’, ‘gags’], ‘0’)
def build_vocab(sentences, word_size=20000):c = Counter()for sent in sentences:for word in sent:c[word] += 1print('文本总单词量为:',len(c))words_most_common = c.most_common(word_size)## adding unk, padidx2word = ['<pad>','<unk>'] + [item[0] for item in words_most_common]word2dix = {w:i for i, w in enumerate(idx2word)}return idx2word, word2dix
WORD_SIZE=20000
idx2word, word2dix = build_vocab(train_sentences, word_size=WORD_SIZE)
文本总单词量为: 14828
idx2word[:10]
[’’, ‘’, ‘the’, ‘,’, ‘a’, ‘and’, ‘of’, ‘.’, ‘to’, “'s”]
def numeralization(sentences, labels, word2idx):'把word list表示的句子转成 index 表示的列表'numeral_sent = [[(w, word2dix['<unk>']) for w in s] for s in sentences]numeral_label =[int(label) for label in labels]return list(zip(numeral_sent, numeral_label))
num_train_data = numeralization(train_sentences, train_labels, word2dix)
num_test_data = numeralization(test_sentences, test_labels, word2dix)
num_dev_data = numeralization(dev_sentences, dev_labels, word2dix)
def convert2tensor(batch_sentences):'将batch数据转成tensor,这里主要是为了padding'lengths = [len(s) for s in batch_sentences]max_len = max(lengths)batch_size = len(batch_sentences)batch = s(batch_size, max_len, dtype=torch.long)for i, l in enumerate(lengths):batch[i, :l] = sor(batch_sentences[i])return batch
def generate_batch(numeral_sentences_labels, batch_size=32):'''将list index 数据 分成batch '''batches = []num_sample = len(numeral_sentences_labels)random.shuffle(numeral_sentences_labels)numeral_sent = [n[0] for n in numeral_sentences_labels]numeral_label = [n[1] for n in numeral_sentences_labels]for start in range(0, num_sample, batch_size):end = start + batch_sizeif end > num_sample:batch_sentences = numeral_sent[start : num_sample]batch_labels = numeral_label[start : num_sample]batch_sent_tensor = convert2tensor(batch_sentences)batch_label_tensor = sor(batch_labels, dtype=torch.float)else:batch_sentences = numeral_sent[start : end]batch_labels = numeral_label[start : end]batch_sent_tensor = convert2tensor(batch_sentences)batch_label_tensor = sor(batch_labels, dtype=torch.float)batches.append((batch_sent_tensor.cuda(), batch_label_tensor.cuda()))return batches
train_data = generate_batch(num_train_data)
a = train_data[4]
text,label=a
text
tensor([[ 2, 1470, 0, …, 0, 0, 0],
[ 3789, 0, 0, …, 0, 0, 0],
[ 2056, 15, 283, …, 0, 0, 0],
…,
[11711, 3, 12789, …, 42, 2365, 7],
[ 1484, 524, 0, …, 0, 0, 0],
[ 308, 11, 10, …, 0, 0, 0]], device=‘cuda:0’)
class AVGModel(nn.Module):def __init__(self, vocab_size, embed_dim, output_size, pad_idx):super().__init__()bedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)initrange = bedding.weight.data.uniform_(-initrange, initrange)self.fc = nn.Linear(embed_dim, output_size)def forward(self, text):## [batch_size, seq_len]->[batch_size, seq_len, embed_dim]embed = bedding(text)## attention##[batch_size, seq_len, embed_dim]->[batch_size, embed_dim]pooled = F.avg_pool2d(embed, (embed.size(1),1)).squeeze(1)## [batch_size, embed_dim]->[batch_size, output_size]out = self.fc(pooled)return outdef get_embed_weigth(self):bedding.weight.data
VOCAB_SIZE = len(word2dix)
EMBEDDING_DIM = 100
OUTPUT_SIZE = 1
PAD_IDX = word2dix['<pad>']
model = AVGModel(vocab_size=VOCAB_SIZE,embed_dim=EMBEDDING_DIM,output_size=OUTPUT_SIZE, pad_idx=PAD_IDX)
(device)
AVGModel(
(embedding): Embedding(14830, 100, padding_idx=0)
(fc): Linear(in_features=100, out_features=1, bias=True)
)
criterion = nn.BCEWithLogitsLoss()
criterion = (device)
optimizer = optim.Adam(model.parameters(), lr=1e-4)
def get_accuracy(output, label):## output: batch_size y_hat = und(torch.sigmoid(output)) ## 将output 转成0和1correct = (y_hat == label).float()acc = correct.sum()/len(correct)return acc
def evaluate(batch_data, model, criterion, get_accuracy):model.eval()num_epoch = epoch_loss = epoch_acc = _grad():for text, label in batch_data:out = model(text).squeeze(1)loss = criterion(out, label)acc = get_accuracy(out, label)num_epoch +=1 epoch_loss += loss.item()epoch_acc += acc.item()return epoch_loss/num_epoch, epoch_acc/num_epoch
def train(batch_data, model, criterion, optimizer, get_accuracy):ain()num_epoch = epoch_loss = epoch_acc = 0for text, label in batch__grad()out = model(text).squeeze(1)loss = criterion(out, label)acc = get_accuracy(out, label)loss.backward()optimizer.step()num_epoch +=1 epoch_loss += loss.item()epoch_acc += acc.item()return epoch_loss/num_epoch, epoch_acc/num_epoch
NUM_EPOCH = 30
best_valid_acc = -1dev_data = generate_batch(num_dev_data)
for epoch in range(NUM_EPOCH):train_data = generate_batch(num_train_data)train_loss, train_acc = train(train_data, model, criterion, optimizer, get_accuracy)valid_loss, valid_acc = evaluate(dev_data, model, criterion, get_accuracy)if valid_acc > best_valid_acc:best_valid_acc = valid_acctorch.save(model.state_dict(),'avg-model.pt')print(f'Epoch: {epoch+1:02} :')print(f't Train Loss: {train_loss:.4f} | Train Acc: {train_acc*100:.2f}%')print(f't Valid Loss: {valid_loss:.4f} | Valid Acc: {valid_acc*100:.2f}%')
Epoch: 01 :
Train Loss: 0.1558 | Train Acc: 94.39%
Valid Loss: 0.6171 | Valid Acc: 82.25%
Epoch: 02 :
Train Loss: 0.1550 | Train Acc: 94.45%
Valid Loss: 0.6319 | Valid Acc: 81.47%
Epoch: 03 :
Train Loss: 0.1526 | Train Acc: 94.53%
Valid Loss: 0.6300 | Valid Acc: 82.59%
Epoch: 04 :
Train Loss: 0.1510 | Train Acc: 94.60%
Valid Loss: 0.6502 | Valid Acc: 81.25%
Epoch: 05 :
Train Loss: 0.1495 | Train Acc: 94.64%
Valid Loss: 0.6515 | Valid Acc: 82.37%
model.load_state_dict(torch.load('avg-model.pt'))
<All keys matched successfully
test_data = generate_batch(num_test_data)
test_loss, test_acc = evaluate(test_data, model, criterion, get_accuracy)
print(f'Test Loss: {test_loss:.4f} | Test Acc: {test_acc*100:.2f}%')
Test Loss: 0.5369 | Test Acc: 81.23%
embed = _embed_weigth()
embed_norm = (embed, p=None, dim=1)
sort_embed_norm, sort_embed_norm_idx = embed_norm.sort()
print('norm 最小的30个单词:')
for idx in sort_embed_norm_idx[:30].tolist():print(idx2word[idx], end=' / ')
norm 最小的30个单词:
par / holiday / pastiche / seedy / e-graveyard / quieter / home / captain / keeps / possibly / urge / aching / career / album / code / elegy / peculiar / squint / handheld / blown / quite / cops / miss / the / blush / judd / trip / appointed / make / themselves /
print('norm 最大的30个单词:')
for idx in sort_embed_norm_idx[-30:].tolist():print(idx2word[idx], end=' / ')
norm 最大的30个单词:
wonderfully / lousy / unlikable / choppy / badly / splendid / worst / dazzling / outstanding / inept / listless / lacking / playful / mesmerizing / unnecessary / amazing / stunning / irritating / unimaginative / refreshingly / heartwarming / devoid / riveting / suffers / tiresome / pointless / thought-provoking / poorly / mess / unfunny /
本文发布于:2024-01-31 15:50:33,感谢您对本站的认可!
本文链接:https://www.4u4v.net/it/170668743429645.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
留言与评论(共有 0 条评论) |