首页 > 编程札记 > 编程

（系列更新完毕）深度学习零基础使用 PyTorch 框架跑 MNIST 数据集的第三天：训练模型

阅读：评论：0

1. Introduction

今天是尝试用 PyTorch 框架来跑 MNIST 手写数字数据集的第三天，主要学习训练网络。本 blog 主要记录一个学习的路径以及学习资料的汇总。

注意：这是用 Python 2.7 版本写的代码

第一天（LeNet 网络的搭建）：

第二天（加载 MNIST 数据集）：

第三天（训练模型）：

第四天（单例测试）：

2. Code（mnist_train.py）

感谢凯神提供的代码与耐心指导！

from lenet import Net
import torch
import torch.optim as optim
functional as F
import matplotlib.pyplot as plt
from mnist_load import testset_loader, trainset_loaderLEARNING_RATE = 0.001
MOMENTUM = 0.9
EPOCH = 5if torch.cuda.is_available():device = torch.device('cuda')print 'cuda'
else:device = torch.device('cpu')print 'cpu'mnist_model = Net().to(device)optimizer = optim.SGD(mnist_model.parameters(),lr=LEARNING_RATE,momentum=MOMENTUM
)# save_model
def save_checkpoint(checkpoint_path, model, optimizer):# state_dict: a Python dictionary object that:# - for a model, maps each layer to its parameter tensor;# - for an optimizer, contains info about the optimizer's states and hyperparameters used.state = {'model': model.state_dict(),'optimizer' : optimizer.state_dict()}torch.save(state, checkpoint_path)print 'model saved to ', checkpoint_path# train
def mnist_train(epoch, save_interval):ain()  # set training modeiteration = 0loss_plt = []for ep in range(epoch):for batch_idx, batch_data in enumerate(trainset_loader):images, labels = batch_dataimages = (device)labels = (_grad()output = mnist_model(images)loss = F.cross_entropy(output, labels)loss_plt.append(loss)loss.backward()optimizer.step()print 'Train Epoch:', ep+1, 'tBatch: ', batch_idx+1, '/', len(trainset_loader), 'tLoss: ', loss.item()# different from before: saving model checkpointsif iteration % save_interval == 0 and iteration > 0:save_checkpoint('module/pytorch-mnist-batchsize-128-%i.pth' % iteration, mnist_model, optimizer)iteration += 1mnist_test()# save the final modelsave_checkpoint('module/pytorch-mnist-batch-128-%i.pth' % iteration, mnist_model, optimizer)plt.plot(loss_plt, label='loss')plt.legend()plt.show()# test
def mnist_test():mnist_model.eval()  # set evaluation modetest_loss = 0correct = _grad():for images, labels in testset_loader:images = (device)labels = (device)output = mnist_model(images)test_loss += F.cross_entropy(output, labels).item()pred = output.max(1, keepdim=True)[1] # get the index of the max log-probabilitycorrect += pred.eq(labels.view_as(pred)).sum().item()test_loss /= len(testset_loader.dataset)print 'nTest set: Average loss:', test_loss, 'tAccuracy:', (100. * correct / len(testset_loader.dataset)), '%n'if __name__ == '__main__':mnist_train(EPOCH, save_interval=1000)

3. Materials

1、torch.optim 优化算法包：

.html

4. Details

1、OSError: [Errno 12] Cannot allocate memory

一开始以为是自己电脑配置（内存不够大）太低，每次 load 一个 batch 的图片数量不能太多，所以就一直在改 BATCH_SIZE 这个超参数。后面不停降低 BATCH_SIZE 还总报错，就意识到应该不是内存容量的问题。

后来查了一下，是加载数据（batch）的线程数目问题

2、需要自己先新建好 Module 文件夹

好吧，原来 Python 写文件的时候，如果路径中的文件夹不存在，是不会自动创建好的。Mark！

3、优化器中的 momentum 参数（待查阅更多有关优化器的资料）

凯神的解释：MOMENTUM 动量是随机梯度下降中用于更新模型权重的一个参数

.html

4、(device)

将所有最开始读取数据时的 tensor 变量 copy 一份到指定设备 device 上，之后的运算都在指定设备上进行。

.htm

5、Module.parameters()

=distribute.pc_-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param&depth_1-utm_source=distribute.pc_-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param

6、`if name == "main"`

7、state_dict()

.nn.Module.Module.state_dict
=distribute.-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param&depth_1-utm_source=distribute.-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param

8、checkpoint

a way to save the current state of your experiment so that you can pick up from where you left off.

.html

9、为什么要使用 _grad()

因为后面反向传播时优化器会自动计算梯度，不要让上一次迭代的梯度影响到本次迭代的梯度

10、optimizer.step() 和 loss.backward() 的区别

最开始有点搞不清楚这两个函数分别是干什么的。后来看视频拿个类比，我就明白了

线性回归中，权值参数的公式为：w_new = w_old + lr * gradient

loss.backward() 就相当于计算 gradient 的

optimizer.step() 就相当于根据 gradient 计算 w_new = w_old + lr * gradient 的

.html

11、_grad() 和 model.eval()

Use both. They do different things, and have different scopes.
_grad： disables tracking of gradients in autograd.
model.eval()： changes the forward() behaviour of the module it is called upon. eg, it disables dropout and has batch norm use the entire population statistics

.html

=distribute.-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param&depth_1-utm_source=distribute.-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param

本文发布于:2024-02-01 13:36:21，感谢您对本站的认可！

本文链接：https://www.4u4v.net/it/170676578136977.html

上一篇：5.2.3 周期噪声

下一篇：计算机专业262分,四川工科考生262分能进成都理工大学复式吗？

标签：框架深度模型基础系列

留言与评论（共有 0 条评论）

（系列更新完毕）深度学习零基础使用 PyTorch 框架跑 MNIST 数据集的第三天：训练模型