pytorch单精度、半精度、混合精度、单卡、多卡(DP / DDP)、FSDP、DeepSpeed模型训练、模型保存、模型推理、onnx导出、onnxruntime推理等示例代码,并对比不同方法的训练速度以及GPU内存的使用。
代码:pytorch_model_train
在了解各种训练方式之前,先来看一下 FairScale 给出的一个模型训练方式选择的流程,选择适合自己的方式,就是最好的。
备注:
备注: 单卡半精度训练的准确率只有75%,单精度的准确率在85%左右
AUTOMATIC MIXED PRECISION PACKAGE - TORCH.AMP
CUDA AUTOMATIC MIXED PRECISION EXAMPLES
PyTorch 源码解读之 torch.cuda.amp: 自动混合精度详解
如何使用 PyTorch 进行半精度、混(合)精度训练
如何使用 PyTorch 进行半精度训练
pytorch模型训练之fp16、apm、多GPU模型、梯度检查点(gradient checkpointing)显存优化等
Working with Multiple GPUs
# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()for epoch in epochs:for input, target in _grad()# Runs the forward pass with autocasting.with autocast(device_type='cuda', dtype=torch.float16):output = model(input)loss = loss_fn(output, target)# Scales loss. Calls backward() on scaled loss to create scaled gradients.# Backward passes under autocast are not recommended.# Backward ops run in the same dtype autocast chose for corresponding forward ops.scaler.scale(loss).backward()# scaler.step() first unscales the gradients of the optimizer's assigned params.# If these gradients do not contain infs or NaNs, optimizer.step() is then called,# otherwise, optimizer.step() is skipped.scaler.step(optimizer)# Updates the scale for next iteration.scaler.update()
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
utils.clip_grad_norm_(model.parameters(), max_norm)
scaler.step(optimizer)
scaler.update()
# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")with torch.autocast(device_type="cuda"):# is on autocast's list of ops that should run in float16.# Inputs are float32, but the op runs in float16 and produces float16 output.# No manual casts are required.e_float16 = (a_float32, b_float32)# Also handles mixed input typesf_float16 = (d_float32, e_float16)# After exiting autocast, calls f_float16.float() to use with d_float32
g_float32 = (d_float32, f_float16.float())
# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")with torch.autocast(device_type="cuda"):e_float16 = (a_float32, b_float32)with torch.autocast(device_type="cuda", enabled=False):# Calls e_float16.float() to ensure float32 execution# (necessary because e_float16 was created in an autocasted region)f_float32 = (c_float32, e_float16.float())# No manual casts are required when re-entering the autocast-enabled region.# again runs in float16 and produces float16 output, regardless of input types.g_float16 = (d_float32, f_float32)
pytorch-multi-gpu-training
/ddp_train.py
DISTRIBUTED COMMUNICATION PACKAGE - TORCH.DISTRIBUTED
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 pytorch_DDP.py
huggingface/accelerate
Hugging Face开源库accelerate详解
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
accelerate launch --config_file ./config/l accelerate_DDP.py
Pytorch FULLY SHARDED DATA PARALLEL (FSDP) 初识
2023 年了,大模型训练还要不要用 PyTorch 的 FSDP ?
GETTING STARTED WITH FULLY SHARDED DATA PARALLEL(FSDP)
batch_size == 1
batch_size == 128
代码文件:pytorch_FSDP.py
训练时长(5 epoch):581 s
训练结果:准确率85%左右
备注: pytorch里面的FSDP的batchsize是指单张卡上的batch大小
注意: to save the FSDP model, we need to call the state_dict on each rank then on Rank 0 save the overall states.翻译过来就是使用下面形式的代码来保存FSDP模型(否则,保存模型的时候会卡主):
states = model.state_dict()if rank == 0:torch.save(states, "model.pt")
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 pytorch_FSDP.py
代码中指定对Resnet50中的Linear和Conv2d层应用FSDP。
batch_size == 1
batch_size == 128
代码文件:accelerate_FSDP.py
训练时长(5 epoch):576 s,对于这个小模型速度和DDP相当
训练结果:准确率85%左右
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:fsdp_auto_wrap_policy: SIZE_BASED_WRAPfsdp_backward_prefetch_policy: BACKWARD_PREfsdp_forward_prefetch: truefsdp_min_num_params: 1000000fsdp_offload_params: falsefsdp_sharding_strategy: 1fsdp_state_dict_type: SHARDED_STATE_DICTfsdp_sync_module_states: truefsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
accelerate launch --config_file ./config/l accelerate_FSDP.py
[BUG] error: unrecognized arguments: --deepspeed ./ds_config.json #3961
fused_adam.so: cannot open shared object file: No such file or directory #119
DeepSpeedExamples/training/cifar/
Getting Started
代码文件:pytorch_DeepSpeed.py
单卡显存占用:
单卡GPU使用率峰值:
训练时长(5 epoch):
训练结果:
代码启动命令(单机 4 GPU)
deepspeed pytorch_DeepSpeed.py --deepspeed_config ./config/zero_stage2_config.json
DeepSpeed介绍
深度解析:如何使用DeepSpeed加速PyTorch模型训练
DeepSpeed
详见各方法的训练代码文件。
torch.save(model.state_dict(), model_name)
torch.dule.state_dict(), model_name)
states = model.state_dict()if rank == 0:torch.save(states, model_name)
详见model_inference.py代码文件
详见model_inference.py代码文件
本文发布于:2024-02-01 16:23:09,感谢您对本站的认可!
本文链接:https://www.4u4v.net/it/170677579137915.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
留言与评论(共有 0 条评论) |