犯了一个错误,使用kernel调用cuda API
来源:互联网 发布:kali linux是什么 编辑:程序博客网 时间:2024/06/02 20:58
http://stackoverflow.com/questions/16780258/running-fftw-on-gpu-vs-using-cufft
running FFTW on GPU vs using CUFFT
I have a basic C++ FFTW implementation that looks like this:
for (int i = 0; i < N; i++){ // declare pointers and plan fftw_complex *in, *out; fftw_plan p; // allocate in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); // initialize "in" ... // create plan p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE); // execute plan fftw_execute(p); // clean up fftw_destroy_plan(p); fftw_free(in); fftw_free(out);}
I'm doing N fft's in a for loop. I know I can execute many plans at once with FFTW, but in my implementation in and out are different every loop. The point is I'm doing the entire FFTW pipeline INSIDE a for loop.
I want to transition to using CUDA to speed this up. I understand that CUDA has its own FFT library CUFFT. The syntax is very similar: From their online documentation:
#define NX 64#define NY 64#define NZ 128cufftHandle plan;cufftComplex *data1, *data2;cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ);cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ);/* Create a 3D FFT plan. */cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C);/* Transform the first signal in place. */cufftExecC2C(plan, data1, data1, CUFFT_FORWARD);/* Transform the second signal using the same plan. */cufftExecC2C(plan, data2, data2, CUFFT_FORWARD);/* Destroy the cuFFT plan. */cufftDestroy(plan);cudaFree(data1); cudaFree(data2);
However, each of these "kernels" (as Nvida calls them) (cufftPlan3d, cufftExecC2C, etc.) are calls to-and-from the GPU. If I understand the CUDA structure correctly, each of these method calls are INDIVIDUALLY parallelized operations:
#define NX 64#define NY 64#define NZ 128cufftHandle plan;cufftComplex *data1, *data2;cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ);cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ);/* Create a 3D FFT plan. */cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU/* Transform the first signal in place. */cufftExecC2C(plan, data1, data1, CUFFT_FORWARD); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU/* Transform the second signal using the same plan. */cufftExecC2C(plan, data2, data2, CUFFT_FORWARD); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU/* Destroy the cuFFT plan. */cufftDestroy(plan);cudaFree(data1); cudaFree(data2);
I understand how this can speed up my code by running each FFT step on a GPU. But, what if I want to parallelize my entire for loop? What if I want each of my original N for loops to run the entire FFTW pipeline on the GPU? Can I create a custom "kernel" and call FFTW methods from the device (GPU)?
1 Answer
You cannot call FFTW methods from device code. The FFTW libraries are compiled x86 code and will not run on the GPU.
If the "heavy lifting" in your code is in the FFT operations, and the FFT operations are of reasonably large size, then just calling the cufft library routines as indicated should give you good speedup and approximately fully utilize the machine. Once the machine is fully utilized, there is generally no additional benefit to trying to run more things in parallel.
cufft routines can be called by multiple host threads, so it is possible to make multiple calls into cufft for multiple independent transforms. It's unlikely you would see much speedup from this if the individual transforms are large enough to utilize the machine.
cufft also supports batched plans which is another way to execute multiple transforms "at once".
- 犯了一个错误,使用kernel调用cuda API
- 犯了一个错误...
- 犯了一个错误
- 犯了一个大错误
- Python使用with时本猿犯了一个错误
- 犯了一个错误--丧失一个机会
- 犯了一个Hibernate的低级错误
- 一个犯了N久的错误
- 第一次笔试 犯了一个错误
- 犯了一个简单的错误
- 犯了一个Hibernate的低级错误
- 犯了一个Hibernate的低级错误
- 犯了一个严重的错误
- 犯了一个很搞笑的错误
- 今天犯了一个大错误
- 犯了一个极傻的错误
- 记录记录,犯了一个低级错误~
- 犯了一个超低级错误
- Android第一周(第二部分)-Intent和Activity生命周期
- C#秒转换小时
- Java IO分析(包括旧IO和NIO)
- 在C#中创建和读取XML文件
- mysql 工具
- 犯了一个错误,使用kernel调用cuda API
- 一个月开发微信蓝牙四旋翼飞机
- 安装Flash Builder4.6遇到的问题
- oracle 11g 测试服务器修改system密码时显示???解决方案
- 一起深入理解BFC~
- Spring-10:通过工厂方法配置Bean
- MKDEV(int major,int minor)
- hdu 5730 Shell Necklace(2016多校第一场)FFT+分治
- 二维数组的创建与遍历(php)