学习目标
我们会使用 torchtext 来创建vocabulary, 然后把数据读成batch的格式。请大家自行阅读README来学习torchtext。
In [1]:
import torchtext from torchtext.vocab import Vectors import torch import numpy as np import randomUSE_CUDA = torch.cuda.is_available()# 为了保证实验结果可以复现,我们经常会把各种random seed固定在某一个值 random.seed(53113) np.random.seed(53113) torch.manual_seed(53113) if USE_CUDA:torch.cuda.manual_seed(53113)BATCH_SIZE = 32 EMBEDDING_SIZE = 650 MAX_VOCAB_SIZE = 50000
Field
,它决定了你的数据会如何被处理。我们使用TEXT
这个field来处理文本数据。我们的TEXT
field有lower=True
这个参数,所以所有的单词都会被lowercase。build_vocab
可以根据我们提供的训练数据集来创建最高频单词的单词表,max_size
帮助我们限定单词总量。In [2]:
TEXT = torchtext.data.Field(lower=True) train, val, test = torchtext.datasets.LanguageModelingDataset.splits(path=".", train="", validation="", test="", text_field=TEXT) TEXT.build_vocab(train, max_size=MAX_VOCAB_SIZE) print("vocabulary size: {}".format(len(TEXT.vocab)))VOCAB_SIZE = len(TEXT.vocab) train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits((train, val, test), batch_size=BATCH_SIZE, device=-1, bptt_len=32, repeat=False, shuffle=True)
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu. The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu. The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
vocabulary size: 50002
<unk>
表示未知的单词,<pad>
表示padding。In [5]:
it = iter(train_iter) batch = next(it) print(" ".join([TEXT.vocab.itos[i] for i [:,1].data])) print(" ".join([TEXT.vocab.itos[i] for i in b
本文发布于:2024-02-02 14:10:58,感谢您对本站的认可!
本文链接:https://www.4u4v.net/it/170685425744324.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
留言与评论(共有 0 条评论) |