近期实验自动驾驶大模型UniAD时,发现按照默认的配置跑程序在我自己安装的环境里(我的环境里CUDA和pytorch等各种支撑软件比作者在github上列的要新)运行测试时总是报错TypeError: cannot pickle 'dict_keys' object
File "./tools/test.py", line 261, in <module>main()File "./tools/test.py", line 231, in mainoutputs = custom_multi_gpu_test(model, data_loader, pdir,File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/uniad/apis/test.py", line 88, in custom_multi_gpu_testfor i, data in enumerate(data_loader):File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 438, in __iter__return self._get_iterator()File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 384, in _get_iteratorreturn _MultiProcessingDataLoaderIter(self)File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1048, in __init__w.start()File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 121, in startself._popen = self._Popen(self)File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 224, in _Popenreturn __context().Process._Popen(process_obj)File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 284, in _Popenreturn Popen(process_obj)File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__super().__init__(process_obj)File "/opt/conda/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__self._launch(process_obj)File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launchreduction.dump(process_obj, fp)File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 60, in dumpForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'dict_keys' object
开始有点抓狂,虽然知道python多进程并发时需要在多个进程间传递对象并且是通过使用pickle模块进行对象的序列化和反序列化,但是不知道到底是哪里的代码写法不合要求导致哪个对象不能序列化,表面上看是dataloader在刚开始迭代访问数据时触发的,但是dataloader是pytorch标准的dataloader,猜测很可能跟dataset有关,但是看UniAD里projects/mmdet3d_plugin/datasets/nuscenes_e2e_dataset.py里的NuScenesE2EDataset类的代码没发现什么异常,起码没看到有什么地方把dict_keys数据往数据集里放。NuScenesE2EDataset从mmdetection3d/mmdet3d/datasets/nuscenes_dataset.py里的NuScenesDataset继承来的,但是mmdetection3d用了那么久了怎么网上查不到别人报告有这样的问题呢?况且mmdetection3d的NuScenesDataset里又调用了nuscenes包里的基础代码,牵涉挺多一下看不出到底是哪里代码不符合要求,导致被序列化的对象里含有dict_keys,只好回头老老实实看出错的地方的代码,也就是python3.8/multiprocessing/reduction.py里这个地方:
class ForkingPickler(pickle.Pickler):'''Pickler subclass used by multiprocessing.'''_extra_reducers = {}_copyreg_dispatch_table = copyreg.dispatch_tabledef __init__(self, *args):super().__init__(*args)self.dispatch_table = self._copyreg_py()self.dispatch_table.update(self._extra_reducers)@classmethoddef register(cls, type, reduce):'''Register a reduce function for a type.'''cls._extra_reducers[type] = reduce@classmethoddef dumps(cls, obj, protocol=None):buf = io.BytesIO()cls(buf, protocol).dump(obj)buffer()loads = pickle.loadsregister = isterdef dump(obj, file, protocol=None):'''Replacement for pickle.dump() using ForkingPickler.'''ForkingPickler(file, protocol).dump(obj)
抛出异常的地方是dump()函数里的ForkingPickler(file, protocol).dump(obj),更底层的stack没有了,猜测这里再往下应该是调用的C++实现的库,于是想继续查找 pickle.Pickler的实现代码,发现python3.8/pickle.py里Pickler是公开对外的类但是没找到它的定义代码,只有个_Pickler内部类
__all__ = ["PickleError", "PicklingError", "UnpicklingError", "Pickler","Unpickler", "dump", "dumps", "load", "loads"]...# Pickling machineryclass _Pickler:def __init__(self, file, protocol=None, *, fix_imports=True,buffer_callback=None):"""This takes a binary file for writing a pickle data stream.
...
仔细找发现了这段代码:
# Use the faster _pickle if possible
try:from _pickle import (PickleError,PicklingError,UnpicklingError,Pickler,Unpickler,dump,dumps,load,loads)
except ImportError:Pickler, Unpickler = _Pickler, _Unpicklerdump, dumps, load, loads = _dump, _dumps, _load, _loads
这就明白了,Pickler类肯定是从有个名叫_pickle的so库(查找了一下这个库是python3.8/lib-dynload/_pickle.cpython-38-x86_64-linux-gnu.so)里导入的,导入出错时使用内部类_Pickler来替代Pickler,根据提示看来Pickler是C++版的快速版,python实现的内部类_Pickler则是低速版的实现。
上面的报信息里并没有说dict_keys在什么object里,如果我们能打印出来知道这个object应该能提供些线索,但是去下载C++版的Pickler代码加打印后编译然后替换python默认提供的python3.8/lib-dynload/_pickle.cpython-38-x86_64-linux-gnu.so不仅工作量大还可能面临版本不匹配引发新的错误问题,所以想,既然pickle.py里提供了纯python实现的低速版_Pickler类,那它的功能实现应该是正确的只是速度慢而已,用它替换来调查问题因为是纯python的还能报出异常的完整堆栈岂不是很好,于是修改python3.8/multiprocessing/reduction.py将class ForkingPickler(pickle.Pickler):
改成
class ForkingPickler(pickle._Pickler):
然后再运行等抛出异常时,根据异常堆栈信息就可以找到我们可以加打印的合适地方了:
Traceback (most recent call last):File "./tools/test.py", line 261, in <module>main()File "./tools/test.py", line 231, in mainoutputs = custom_multi_gpu_test(model, data_loader, pdir,File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/uniad/apis/test.py", line 88, in custom_multi_gpu_testfor i, data in enumerate(data_loader):File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 438, in __iter__return self._get_iterator()File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 384, in _get_iteratorreturn _MultiProcessingDataLoaderIter(self)File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1048, in __init__w.start()File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 121, in startself._popen = self._Popen(self)File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 224, in _Popenreturn __context().Process._Popen(process_obj)File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 284, in _Popenreturn Popen(process_obj)File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__super().__init__(process_obj)File "/opt/conda/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__self._launch(process_obj)File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launchreduction.dump(process_obj, fp)File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 61, in dumpForkingPickler(file, protocol).dump(obj)File "/opt/conda/lib/python3.8/pickle.py", line 489, in dumpself.save(obj)File "/opt/conda/lib/python3.8/pickle.py", line 605, in saveself.save_reduce(obj=obj, *rv)File "/opt/conda/lib/python3.8/pickle.py", line 719, in save_reducesave(state)File "/opt/conda/lib/python3.8/pickle.py", line 562, in savef(self, obj) # Call unbound method with explicit selfFile "/opt/conda/lib/python3.8/pickle.py", line 973, in save_dictself._batch_setitems(obj.items())File "/opt/conda/lib/python3.8/pickle.py", line 999, in _batch_setitemssave(v)File "/opt/conda/lib/python3.8/pickle.py", line 562, in savef(self, obj) # Call unbound method with explicit selfFile "/opt/conda/lib/python3.8/pickle.py", line 903, in save_tuplesave(element)File "/opt/conda/lib/python3.8/pickle.py", line 605, in saveself.save_reduce(obj=obj, *rv)File "/opt/conda/lib/python3.8/pickle.py", line 719, in save_reducesave(state)File "/opt/conda/lib/python3.8/pickle.py", line 562, in savef(self, obj) # Call unbound method with explicit selfFile "/opt/conda/lib/python3.8/pickle.py", line 973, in save_dictself._batch_setitems(obj.items())File "/opt/conda/lib/python3.8/pickle.py", line 999, in _batch_setitemssave(v)File "/opt/conda/lib/python3.8/pickle.py", line 605, in saveself.save_reduce(obj=obj, *rv)File "/opt/conda/lib/python3.8/pickle.py", line 719, in save_reducesave(state)File "/opt/conda/lib/python3.8/pickle.py", line 562, in savef(self, obj) # Call unbound method with explicit selfFile "/opt/conda/lib/python3.8/pickle.py", line 973, in save_dictself._batch_setitems(obj.items())File "/opt/conda/lib/python3.8/pickle.py", line 999, in _batch_setitemssave(v)File "/opt/conda/lib/python3.8/pickle.py", line 580, in saverv = reduce(self.proto)
TypeError: cannot pickle 'dict_keys' object
根据上面的堆栈信息去对照看代码,考虑到object里可能又嵌套有object(例如dict里嵌套有dict),发现在_Pickler类的save_dict()这个成员方法里加打印比较合适:
def save_dict(self, obj):if self.bin:self.write(EMPTY_DICT)else: # proto 0 -- can't use EMPTY_DICTself.write(MARK + ize(obj)strobj = str(obj)if "dict_keys" in strobj: print("@@@@@@@@@@@@@@@obj:", obj, "@@@@@@@##### n")self._batch_setitems(obj.items())
另外在save()成员方法的出错的前面加打印,打出出错的object本身以及前面没出错的相邻的object的type信息(不要直接打印所有object,否则打印非常多,重新问题慢或者并发时出现其它新的问题):
# Check for a __reduce_ex__ method, fall back to __reduce__reduce = getattr(obj, "__reduce_ex__", None)print("!!!!!!######self.proto:", self.proto, type(obj), "!!!!%%")if reduce is not None:if 'dict_keys' in str(type(obj)): # Arnoldprint("obj is dict_keys, obj:",obj)rv = reduce(self.proto)
然后运行程序直至抛出异常,可以看到打印出我想要看到的重要信息了!
@@@@@@@@@@@@@@@obj: {'class_range': {'car': 50, 'truck': 50, 'bus': 50, 'trailer': 50, 'construction_vehicle': 50, 'pedestrian': 40, 'motorcycle': 40, 'bicycle': 40, 'traffic_cone': 30, 'barrier': 30}, 'dist_fcn': 'center_distance', 'dist_ths': [0.5, 1.0, 2.0, 4.0], 'dist_th_tp': 2.0, 'min_recall': 0.1, 'min_precision': 0.1, 'max_boxes_per_sample': 500, 'mean_ap_weight': 5, 'class_names': dict_keys(['car', 'truck', 'bus', 'trailer', 'construction_vehicle', 'pedestrian', 'motorcycle', 'bicycle', 'traffic_cone', 'barrier'])} @@@@@@@#####...!!!!!!######self.proto: 4 <class 'mmdet3d.datasets.pipelinespose.Compose'> !!!!%%
!!!!!!######self.proto: 4 <class det3d_plugin.datasets.pipelines.loading.LoadMultiViewImageFromFilesInCeph'> !!!!%%
!!!!!!######self.proto: 4 <class 'fig.ConfigDict'> !!!!%%
!!!!!!######self.proto: 4 <class det3d_plugin.ansform_3d.NormalizeMultiviewImage'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.dtype[float32]'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class det3d_plugin.ansform_3d.PadMultiViewImage'> !!!!%%
!!!!!!######self.proto: 4 <class det3d_plugin.datasets.pipelines.loading.LoadAnnotations3D_E2E'> !!!!%%
!!!!!!######self.proto: 4 <class det3d_plugin.flow_label.GenerateOccFlowLabels'> !!!!%%
!!!!!!######self.proto: 4 <class 'fig.ConfigDict'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.dtype[int64]'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'numpy.ndarray'> !!!!%%
!!!!!!######self.proto: 4 <class 'mmdet3d.st_time_aug.MultiScaleFlipAug3D'> !!!!%%
!!!!!!######self.proto: 4 <class 'mmdet3d.datasets.pipelinespose.Compose'> !!!!%%
!!!!!!######self.proto: 4 <class 'mmdet3d.datasets.pipelines.formating.DefaultFormatBundle3D'> !!!!%%
!!!!!!######self.proto: 4 <class det3d_plugin.ansform_3d.CustomCollect3D'> !!!!%%
!!!!!!######self.proto: 4 <class 'nuscenes.eval.detection.data_classes.DetectionConfig'> !!!!%%
!!!!!!######self.proto: 4 <class 'dict_keys'> !!!!%%
obj is dict_keys, obj: dict_keys(['car', 'truck', 'bus', 'trailer', 'construction_vehicle', 'pedestrian', 'motorcycle', 'bicycle', 'traffic_cone', 'barrier'])
通过上面的信息可以清楚地看到,不能序列化导致抛出异常的数据是:
'class_names': dict_keys(['car', 'truck', 'bus', 'trailer', 'construction_vehicle', 'pedestrian', 'motorcycle', 'bicycle', 'traffic_cone', 'barrier'])
它又是一个dict的一部分,包含它的这个dict不知道是谁,但是class_names以及这个dict里其它数据项信息(例如class_range和dist_fn)可以帮助我们去代码里搜索定位这个dict是哪里产生的,于是去搜索UniAD代码和nuscenes代码,发现这些数据是由nuscenes.eval.detection.data_classes.DetectionConfig类生成的,它在python3.8/site-packages/nuscenes/eval/detection/data_classes.py里:
class DetectionConfig:""" Data class that specifies the detection evaluation settings. """def __init__(self,class_range: Dict[str, int],dist_fcn: str,dist_ths: List[float],dist_th_tp: float,min_recall: float,min_precision: float,max_boxes_per_sample: int,mean_ap_weight: int):assert set(class_range.keys()) == set(DETECTION_NAMES), "Class count mismatch." ### DETECTION_NAMESassert dist_th_tp in dist_ths, "dist_th_tp must be in set of dist_ths."self.class_range = class_rangeself.dist_fcn = dist_fcnself.dist_ths = dist_thsself.dist_th_tp = dist_th_tpself.min_recall = min_recallself.min_precision = min_precisionself.max_boxes_per_sample = max_boxes_an_ap_weight = mean_ap_weightself.class_names = self.class_range.keys()...
class_range的参数定义在python3.8/site-packages/nuscenes/eval/detection/configs/detection_cvpr_2019.json里:
"class_range": {"car": 50,"truck": 50,"bus": 50,"trailer": 50,"construction_vehicle": 50,"pedestrian": 40,"motorcycle": 40,"bicycle": 40,"traffic_cone": 30,"barrier": 30},
在创建DetectionConfig类实例时由构造函数传入的,而class_names则是由 class_range的keys得来的:
self.class_names = self.class_range.keys()
这句话就是根本原因所在!因为python3的dict.keys()获得的是dict_keys而不是list,导致了class_names不能被Pickler序列化!只要把这句话改成:
self.class_names = list(self.class_range.keys())
然后再运行程序时就不出错了!
在继续看了一下DetectionConfig实例是由谁创建的,原来真的跟Dataset有点关系,mmdetection3d/mmdet3d/datasets/nuscenes_dataset.py里的NuScenesDataset的构造函数是这样的:
class NuScenesDataset(Custom3DDataset):...def __init__(self,ann_file,pipeline=None,data_root=None,classes=None,load_interval=1,with_velocity=True,modality=None,box_type_3d='LiDAR',filter_empty_gt=True,test_mode=False,eval_version='detection_cvpr_2019',use_valid_flag=False):self.load_interval = load_intervalself.use_valid_flag = use_valid_flagsuper().__init__(data_root=data_root,ann_file=ann_file,pipeline=pipeline,classes=classes,modality=modality,box_type_3d=box_type_3d,filter_empty_gt=filter_empty_gt,test_mode=test_mode)self.with_velocity = with_velocityself.eval_version = eval_versionfrom nuscenes.fig import config_factoryself.eval_detection_configs = config_factory(self.eval_version)
这里最后一句self.eval_detection_configs = config_factory(self.eval_version)就是去读取python3.8/site-packages/nuscenes/eval/detection/configs/detection_cvpr_2019.json配置文件创建DetectionConfig实例,从而获得class_range等配置数据和获得class_names值。
UniAD的NuScenesE2EDataset继承自NuScenesDataset,其实例里面的eval_detection_configs数据就是这么来的,里面的class_names的值默认是通过dict.keys()获得的,没有转换成Pickler支持的类型,这才导致了TypeError: cannot pickle 'dict_keys' object
另外,如果我们在python3.8/multiprocessing/popen_spawn_posix.py里这个地方加打印,可以看到错误是与进程的数据序列化有关的,每次调用reduction.dump(prep_data, fp)是成功的,调用reduction.dump(process_obj, fp)时抛出上面的异常:
class Popen(popen_fork.Popen):method = 'spawn'DupFd = _DupFddef __init__(self, process_obj):self._fds = []super().__init__(process_obj)def duplicate_for_child(self, fd):self._fds.append(fd)return fddef _launch(self, process_obj):from . import resource_trackertracker_fd = fd()self._fds.append(tracker_fd)prep_data = _preparation_data(process_obj._name)fp = io.BytesIO()set_spawning_popen(self)try:reduction.dump(prep_data, fp)reduction.dump(process_obj, fp)finally:set_spawning_popen(None)
另外,网上其他人遇到的这样的问题,虽然和数据集相关,但和我遇到的这种情况不大相同,例如打开lmdb文件后不能序列化、dataset本身的data的keys()得到dict_keys等等,遇到同样情况时可供参考:
DataLoader Multiprocessing error: can't pickle odict_keys objects when num_workers > 0 - PyTorch Forums
TypeError: can't pickle Environment objects when num_workers > 0 for LSUN · Issue #689 · pytorch/vision · GitHub
本文发布于:2024-01-28 17:59:10,感谢您对本站的认可!
本文链接:https://www.4u4v.net/it/17064359559229.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
留言与评论(共有 0 条评论) |