【GPU】Nvidia CUDA 编程高级教程——利用蒙特卡罗法求解的近似值

阅读：评论：0

博主未授权任何人或组织机构转载博主任何原创文章，感谢各位对原创的支持！
博主链接

本人就职于国际知名终端厂商，负责modem芯片研发。
在5G早期负责终端数据业务层、核心网相关的开发工作，目前牵头6G算力网络技术标准研究。

博客内容主要围绕：
       5G/6G协议讲解
       算力网络讲解（云计算，边缘计算，端计算）
       高级C语言讲解
       Rust语言讲解

利用蒙特卡罗法求解 𝜋 的近似值

算法简介

估算 𝜋 有一个著名的技巧，那就是在单位面积内随机选择大量点，并计算落在单位圆内的点数。因为正方形的面积是 1，圆的面积是 𝜋/4 ，所以落在圆上的点的点数乘以 4，就是一个 𝜋 的良好近似值。

高度可并行

从并行编程的角度来看，该算法的良好特征之一是每个随机点都可以独立计算。我们只需要知道一个点的坐标，即可评估其是否落在圆内，对于点坐标 (𝑥,𝑦) 而言，如果 𝑥 2 + 𝑦 2 < = 1 𝑥^{2}+𝑦^{2}<=1 x2+y2<=1，那么点落在圆内，只要我们能处理好与计数器相关的任何竞态条件，表示圆内点数的计数器就可以递增。

单一 GPU 实现

我们来看看在单 GPU 上的 CUDA 中的实现情况。我们已提供实现情况的示例，如下所示：

#include <iostream>
#include <curand_kernel.h>#define N 1024*1024__global__ void calculate_pi(int* hits) {int idx = threadIdx.x + blockIdx.x * blockDim.x;// 初始化随机数状态（网格中的每个线程不得重复）int seed = 0;int offset = 0;curandState_t curand_state;curand_init(seed, idx, offset, &curand_state);// 在 (0.0, 1.0] 内生成随机坐标float x = curand_uniform(&curand_state);float y = curand_uniform(&curand_state);// 如果这一点在圈内，增加点击计数器if (x * x + y * y <= 1.0f) {atomicAdd(hits, 1);}
}int main(int argc, char** argv) {// 分配主机和设备值int* hits;hits = (int*) malloc(sizeof(int));int* d_hits;cudaMalloc((void**) &d_hits, sizeof(int));// 初始化点击次数并复制到设备*hits = 0;cudaMemcpy(d_hits, hits, sizeof(int), cudaMemcpyHostToDevice);// 启动核函数进行计算int threads_per_block = 256;int blocks = (N + threads_per_block - 1) / threads_per_block;calculate_pi<<<blocks, threads_per_block>>>(d_hits);cudaDeviceSynchronize();// 将最终结果复制回主机cudaMemcpy(hits, d_hits, sizeof(int), cudaMemcpyDeviceToHost);// 计算 pi 的最终值float pi_est = (float) *hits / (float) (N) * 4.0f;// 打印结果std::cout << "Estimated value of pi = " << pi_est << std::endl;std::cout << "Error = " << std::abs((M_PI - pi_est) / pi_est) << std::endl;// 清理free(hits);cudaFree(d_hits);return 0;
}

请注意，此代码仅用于指导目的，并不代表具有特别高的性能。具体原因如下：

我们将使用设备侧 API（属于cuRAND），直接在核函数中生成随机数。即使您不熟悉 cuRAND 也无妨，只需知道每个 CUDA 线程都有各自唯一的随机数即可。
我们让每个线程只计算一个值，所以计算强度很低。
在更改hits（“命中”）计数器时，我们将遇到许多原子操作的冲突。

即便如此，我们仍可用 100 万个样本点快速估算 𝜋 。与正确值相比，我们的计算误差应该仅约为 0.05%。

运行结果如下：

Estimated value of pi = 3.14319
Error = 0.000507708
CPU times: user 51.4 ms, sys: 16.5 ms, total: 67.9 ms
Wall time: 3.17 s

扩展到多个 GPU

有一个简单的方法可以将我们的示例扩展到多个 GPU，那就是使用管理多个 GPU 的单一主机进程。如果我们利用 M 个 GPU 对 N 个采样点进行计算，则可以将N/M采样点分配给每个 GPU，原则上可以M 倍地加快计算。

为了实施这一方法，我们要：

使用cudaGetDeviceCount确定可用 GPU 的数量。
以GPU数量为循环次数，在每次循环中使用cudaSetDevice指定执行代码的是哪个GPU。

在指定的 GPU 上执行分配给它的那部分工作。

int device_count;
cudaGetDeviceCount(&device_count);for (int i = 0; i < device_count; ++i) {cudaSetDevice(i);# Do single GPU worth of work.
}

代码实现

请注意，在此示例中，我们会给每个 GPU 一个不同的随机数生成器种子，以便每个 GPU 进行不同的工作。因此，我们的答案会有所不同。

#include <iostream>
#include <curand_kernel.h>#define N 1024*1024__global__ void calculate_pi(int* hits, int device) {int idx = threadIdx.x + blockIdx.x * blockDim.x;// 初始化随机数状态（网格中的每个线程不得重复）int seed = device;int offset = 0;curandState_t curand_state;curand_init(seed, idx, offset, &curand_state);// 在 (0.0, 1.0] 内生成随机坐标float x = curand_uniform(&curand_state);float y = curand_uniform(&curand_state);// 如果这一点在圈内，增加点击计数器if (x * x + y * y <= 1.0f) {atomicAdd(hits, 1);}
}int main(int argc, char** argv) {// 确定 GPU 数量int device_count;cudaGetDeviceCount(&device_count);std::cout << "Using " << device_count << " GPUs" << std::endl;// 分配主机和设备值（每个 GPU 一个）int** hits = (int**) malloc(device_count * sizeof(int*));for (int i = 0; i < device_count; ++i) {hits[i] = (int*) malloc(sizeof(int));}int** d_hits = (int**) malloc(device_count * sizeof(int*));for (int i = 0; i < device_count; ++i) {cudaSetDevice(i);cudaMalloc((void**) &d_hits[i], sizeof(int));}// 初始化点击次数并复制到设备for (int i = 0; i < device_count; ++i) {*hits[i] = 0;cudaSetDevice(i);cudaMemcpy(d_hits[i], hits[i], sizeof(int), cudaMemcpyHostToDevice);}// 启动核函数进行计算int threads_per_block = 256;int blocks = (N / device_count + threads_per_block - 1) / threads_per_block;// 先启动所有核函数，以支持异步执行// 然后在所有设备上同步。for (int i = 0; i < device_count; ++i) {cudaSetDevice(i);calculate_pi<<<blocks, threads_per_block>>>(d_hits[i], i);}for (int i = 0; i < device_count; ++i) {cudaSetDevice(i);cudaDeviceSynchronize();}// 将最终结果复制回主机for (int i = 0; i < device_count; ++i) {cudaSetDevice(i);cudaMemcpy(hits[i], d_hits[i], sizeof(int), cudaMemcpyDeviceToHost);}// 计算所有设备的点击总数int hits_total = 0;for (int i = 0; i < device_count; ++i) {hits_total += *hits[i];}// 计算 pi 的最终值float pi_est = (float) hits_total / (float) (N) * 4.0f;// 打印结果std::cout << "Estimated value of pi = " << pi_est << std::endl;std::cout << "Error = " << std::abs((M_PI - pi_est) / pi_est) << std::endl;// 清理for (int i = 0; i < device_count; ++i) {free(hits[i]);cudaFree(d_hits[i]);}free(hits);free(d_hits);return 0;
}

运行结果

我这边使用了4个GPU，根据测试环境不同，我们显示的结果也是不同的。

Using 4 GPUs
Estimated value of pi = 3.14072
Error = 0.000277734
CPU times: user 27.2 ms, sys: 16.1 ms, total: 43.3 ms
Wall time: 2.46 s

本文发布于:2024-01-28 09:35:43，感谢您对本站的认可！

本文链接：https://www.4u4v.net/it/17064057476486.html

上一篇：通过修改SPECTRE网表进行后仿

下一篇：【python库mpi4py的蒙特卡洛法求定积分的并行与串行实现】

标签：近似值蒙特卡罗高级教程

留言与评论（共有 0 条评论）

【GPU】Nvidia CUDA 编程高级教程——利用蒙特卡罗法求解 的近似值