k8s mellanox网卡使用dpdk驱动问题总结

阅读: 评论:0

k8s mellanox网卡使用dpdk驱动问题总结

k8s mellanox网卡使用dpdk驱动问题总结

本文主要总结一下在k8s环境中,mellanox网卡使用dpdk driver可能会遇到的问题及解决办法。

1. 不能挂载 /sys 目录到pod中
其他厂家的网卡,比如intel的x710等,如果想在k8s中,使用dpdk driver,/sys目录是必须挂载的,因为dpdk启动过程会读取这个目录下的文件。但是对于mellanox网卡来说,它是比较特殊的,在使用dpdk driver时,也必须绑定在kernel driver mlx5_core上面。

如果挂载了 /sys 目录到pod中,就会报如下的错误。

net_mlx5: port 0 cannot get MAC address, is mlx5_en loaded? (errno: No such file or directory)
net_mlx5: probe of PCI device 0000:00:09.0 aborted after encountering an error: No such device
EAL: Requested device 0000:00:09.0 cannot be used

原因是 host上的 /sys/ 会覆盖 pod 里的 /sys/ 内容,而 mlx 网卡会读取这些目录,比如 /sys/devices/pci0000:00/0000:00:09.0/net/,如果覆盖了,就会报错。下面分析下代码

mlx5_pci_probemlx5_dev_spawn/* Configure the first MAC address by default. */if (mlx5_get_mac(eth_dev, &mac.addr_bytes)) {DRV_LOG(ERR,"port %u cannot get MAC address, is mlx5_en"" loaded? (errno: %s)",eth_dev->data->port_id, strerror(rte_errno));err = ENODEV;goto error;}//如果host上的 /sys/ 覆盖 pod 里的 /sys/ 内容,会出现问题(具体哪行代码出问题还有待调查)
int
mlx5_get_mac(struct rte_eth_dev *dev, uint8_t (*mac)[ETHER_ADDR_LEN])
{struct ifreq request;int ret;ret = mlx5_ifreq(dev, SIOCGIFHWADDR, &request);int sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);mlx5_get_ifname(dev, &ifr->ifr_name);ioctl(sock, req, ifr);if (ret)return ret;memcpy(mac, request.ifr_hwaddr.sa_data, ETHER_ADDR_LEN);return 0;
}

2. dpdk启动过程中找不到mellanox网卡
有时候会出现下面的错误,找不到mellanox网卡,并且提示了是不是没有加载kernel driver,如果没有在k8s环境中,这个提示是有用的,可能网卡确实没有绑定到mlx5_core。

EAL: PCI device 0000:00:06.0 on NUMA socket -1"}
EAL:   Invalid NUMA socket, default to 0"}
EAL:   probe driver: 15b3:101a net_mlx5"}
net_mlx5: no Verbs device matches PCI device 0000:00:06.0, are kernel drivers loaded?"}
EAL: Requested device 0000:00:06.0 cannot be used"}

但是在k8s环境中,就有其他原因了。问题现象是,如果启动的pod有privileged权限,那dpdk是可以识别到网卡并且启动成功的,但是如果没有privileged权限,就会报上面的错误。
先分析下为什么报这个错误,dpdk代码如下

mlx5_pci_probeunsigned int n = 0;//调用 libibverbs 里的函数 ibv_get_device_list 获取 ibv 设备ibv_list = mlx5_glue->get_device_list(&ret);while (ret-- > 0) {ibv_match[n++] = ibv_list[ret];}//在这里报错,说明n为0,n在上面遍历ret的时候赋值,n为0,说明ret也为0if (!n) {DRV_LOG(WARNING,"no Verbs device matches PCI device " PCI_PRI_FMT ","" are kernel drivers loaded?",pci_dev->addr.domain, pci_dev->addr.bus,pci_dev->addr.devid, pci_dev->addr.function);rte_errno = ENOENT;ret = -rte_errno;}

要想知道为什么ret也为0,得分析下 libibverbs 源码,libibverbs 源码可以在 OFED 包中找到,路径如下 MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu16.04-x86_64/src/MLNX_OFED_SRC-5.2-2.2.0.0/SOURCES/rdma-core-52mlnx1/libibverbs

LATEST_SYMVER_FUNC(ibv_get_device_list, 1_1, "IBVERBS_1.1",struct ibv_device **,int *num)ibverbs_get_device_list(&device_list);find_sysfs_devs(&sysfs_list);setup_sysfs_devtry_access_device(sysfs_dev)struct stat cdev_stat;char *devpath;int ret;//查看这个文件是否存在 /dev/infiniband/uverbs0if (asprintf(&devpath, RDMA_CDEV_DIR"/%s", sysfs_dev->sysfs_name) < 0)return ENOMEM;ret = stat(devpath, &cdev_stat);free(devpath);return ret;

由上面代码可知,ibverbs会检查/dev/infiniband/uverbs0文件是否存在(一个网卡对应一个uverbs0),如果不存在就认为没有找到网卡。
如果pod有privileged权限,pod内就可以读取/dev/infiniband/uverbs0这些文件,如果没有privileged权限,pod内部是没有/dev/infiniband这个目录的。

//pod没有privilege特权时
root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ls /dev
core  fd  full  mqueue  null  ptmx  pts  random  shm  stderr  stdin  stdout  termination-log  tty  urandom  zero//pod有privilege特权时,可看到infiniband
root@pod-dpdk:~/# ls /dev/
autofs           infiniband        mqueue              sda2             tty12  tty28  tty43  tty59   ttyS16  ttyS31     vcs4   vfio
...root@pod-dpdk:~# ls /dev/infiniband/
uverbs0

由上可知,pod里有没有这个文件 /dev/infiniband/uverbs0 是关键。如果pod内没有这些 /dev/infiniband 文件就会报上面的错误。

那这个文件怎么才能传到pod里呢?有两种办法
a. 给pod privileged权限,这种方式是不需要将 /dev/ 挂载到pod内部的,因为有特权的pod可以直接访问host上的这些文件
b. 使用 k8s 的 sriov-network-device-plugin,要注意版本,高点的版本才能避免此问题。如果是mlx网卡,它会将需要的device通过 docker的 --device 挂载给container,比如下面的环境,将
/dev/infiniband/uverbs2等文件挂载到container中

root# docker inspect 1dfe96c8eff4
[{"Id": "1dfe96c8eff4c8ede0d8eb4e480fec9f002f68c4da1bb5265580ee968c6d7502","Created": "2021-04-12T03:24:22.598030845Z",..."HostConfig": {..."CapAdd": ["NET_RAW","NET_ADMIN","IPC_LOCK"],"Privileged": false,"Devices": [{"PathOnHost": "/dev/infiniband/ucm2","PathInContainer": "/dev/infiniband/ucm2","CgroupPermissions": "rwm"},{"PathOnHost": "/dev/infiniband/issm2","PathInContainer": "/dev/infiniband/issm2","CgroupPermissions": "rwm"},{"PathOnHost": "/dev/infiniband/umad2","PathInContainer": "/dev/infiniband/umad2","CgroupPermissions": "rwm"},{"PathOnHost": "/dev/infiniband/uverbs2","PathInContainer": "/dev/infiniband/uverbs2","CgroupPermissions": "rwm"},{"PathOnHost": "/dev/infiniband/rdma_cm","PathInContainer": "/dev/infiniband/rdma_cm","CgroupPermissions": "rwm"}],...},

sriov plguin 的 代码,将hostpath上的目录挂载到pod中

// NewRdmaSpec returns the RdmaSpec
func NewRdmaSpec(pciAddrs string) types.RdmaSpec {deviceSpec := make([]*pluginapi.DeviceSpec, 0)isSupportRdma := falserdmaResources := rdmamap.GetRdmaDevicesForPcidev(pciAddrs)if len(rdmaResources) > 0 {isSupportRdma = truefor _, res := range rdmaResources {resRdmaDevices := rdmamap.GetRdmaCharDevices(res)for _, rdmaDevice := range resRdmaDevices {deviceSpec = append(deviceSpec, &pluginapi.DeviceSpec{HostPath:      rdmaDevice,  ContainerPath: rdmaDevice,Permissions:   "rwm",})}}}return &rdmaSpec{isSupportRdma: isSupportRdma, deviceSpec: deviceSpec}
}

sriov plguin 的 log,可看到它会把 /dev/infiniband 下面的几个文件传到pod

###/var/log/sriovdp/sriovdp.INFO
I0412 03:17:57.120886    :123] AllocateResponse send: &AllocateResponse{ContainerResponses:[]
*ContainerAllocateResponse{&ContainerAllocateResponse{Envs:map[string]string{PCIDEVICE_INTEL_COM_DP_SRIOV_MLX5: 0000:00:0a.0,},
Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/infiniband/ucm2,HostPath:/dev/infiniband/ucm2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/issm2,HostPath:/dev/infiniband/issm2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/umad2,HostPath:/dev/infiniband/umad2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/uverbs2,HostPath:/dev/infiniband/uverbs2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/rdma_cm,HostPath:/dev/infiniband/rdma_cm,Permissions:rwm,},},
Annotations:map[string]string{},},},}

3. 没有特权,dpdk启动失败
为了安全考虑,不会给pod特权,这样在pod内部启动dpdk会失败,报错如下

root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ./l2fwd -cf -n4  -w 00:09.0 -- -p1
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing 
EAL: Cannot obtain physical addresses: No such file or directory. Only vfio will function.
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
error allocating rte services array

dpdk的启动参数加上参数 -iova-mode=va,这样就不需要特权了

root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ./l2fwd -cf -n4  -w 00:09.0 --iova-mode=va -- -p1
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing 
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
EAL: PCI device 0000:00:09.0 on NUMA socket -1
EAL:   Invalid NUMA socket, default to 0
EAL:   probe driver: 15b3:1016 net_mlx5

4. 读取网卡统计计数失败

dpdk提供了两个函数,用来获取网卡计数,rte_eth_stats_get和rte_eth_xstats_get,前者用来获取几个固定的计数,比如收发报文数和字节数,
后者用来获取扩展计数,每种网卡都有它自己的计数。

对于mellanox网卡来说,它的dpdk driver mlx5也提供了对应的函数,如下

rte_eth_stats_get -> stats_get -> mlx5_stats_get 
rte_eth_xstats_get -> xstats_get -> mlx5_xstats_get

mlx5_stats_get 在pod内部可以正常读取计数,但是mlx5_xstats_get就会遇到问题。
mlx5_xstats_get -> mlx5_read_dev_counters 这个函数会读取下面路径的文件获取计数

root# ls /sys/devices/pci0000:00/0000:00:09.0/infiniband/mlx5_0/ports/1/hw_counters/
duplicate_request      out_of_buffer    req_cqe_flush_error         resp_cqe_flush_error       rx_atomic_requests
implied_nak_seq_err    out_of_sequence  req_remote_access_errors    resp_local_length_error    rx_read_requests
lifespan               packet_seq_err   req_remote_invalid_request  resp_remote_access_errors  rx_write_requests
local_ack_timeout_err  req_cqe_error    resp_cqe_error              rnr_nak_retry_err

但是在pod内部同样的路径下是没有 hw_counters 目录下。因为这个目录是在加载驱动时生成,在pod内看不到。

root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ls /sys/devices/pci0000:00/0000:00:09.0/infiniband/mlx5_0/ports/1/
cap_mask  gid_attrs  gids  lid  lid_mask_count  link_layer  phys_state  pkeys  rate  sm_lid  sm_sl  state

可能的解决办法:
a. 手动 mount /sys/devices/pci0000:00/0000:00:09.0/infiniband/mlx5_0/ports/1/hw_counters/ 到pod内
b. 修改 sriov-network-device-plugin 代码,自动mount上面的目录

测试文件

root# cat dpdk-mlx.yaml
apiVersion: v1
kind: Pod
metadata:name: pod-dpdkannotations:k8s.v1icf.io/networks: host-device1
spec:nodeName: node1containers:- name: appcntr3image: l2fwd:v3imagePullPolicy: IfNotPresentcommand: [ "/bin/bash", "-c", "--" ]args: [ "while true; do sleep 300000; done;" ]securityContext:privileged: trueresources:requests:memory: 100Mihugepages-2Mi: 500Micpu: '3'limits:hugepages-2Mi: 500Micpu: '3'memory: 100MivolumeMounts:- mountPath: /mnt/hugename: hugepagereadOnly: False- mountPath: /var/runname: varreadOnly: Falsevolumes:- name: hugepageemptyDir:medium: HugePages- name: varhostPath:path: /var/run/

参考




.08/linux_gsg/linux_drivers.html?highlight=bifurcation#bifurcated-driver

也可参考:k8s mellanox网卡使用dpdk驱动问题总结 - 简书 (jianshu) 

本文发布于:2024-02-05 03:25:57,感谢您对本站的认可!

本文链接:https://www.4u4v.net/it/170722939162622.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:网卡   k8s   mellanox   dpdk
留言与评论(共有 0 条评论)
   
验证码:

Copyright ©2019-2022 Comsenz Inc.Powered by ©

网站地图1 网站地图2 网站地图3 网站地图4 网站地图5 网站地图6 网站地图7 网站地图8 网站地图9 网站地图10 网站地图11 网站地图12 网站地图13 网站地图14 网站地图15 网站地图16 网站地图17 网站地图18 网站地图19 网站地图20 网站地图21 网站地图22/a> 网站地图23