CUDAメモ
cudaMemcpy(dst, src, count, cudaMemcpyDeviceToDevice)は、ブロッキングではない。そのため必要に応じて、ホスト側でcudaThreadSynchronize()を呼ぶ。
CUDAで倍精度浮動小数点数を使った場合に、どんな実行コードが生成されるのか確認する。 まずは確認のためのテストプログラム。
- #include <stdio.h>
-
- __global__ void mulArray(double* inputOnGPU, double* outputOnGPU)
- {
- int i = blockDim.x * blockIdx.x + threadIdx.x;
-
- outputOnGPU[i] = 5.0 * inputOnGPU[i];
- }
-
- #define BLOCK_DIM 8
- #define THREAD_DIM 32
- #define N (BLOCK_DIM * THREAD_DIM)
-
- int main()
- {
- double input[N], output[N];
- double* inputOnGPU;
- double* outputOnGPU;
-
- cudaMalloc((void**)&inputOnGPU, sizeof(double) * N);
- cudaMalloc((void**)&outputOnGPU, sizeof(double) * N);
-
- for (int i = 0; i < N; ++i) {
- input[i] = (double)i;
- }
-
- cudaMemcpy(inputOnGPU, input, sizeof(double) * N, cudaMemcpyHostToDevice);
-
- mulArray<<<BLOCK_DIM, THREAD_DIM>>>(inputOnGPU, outputOnGPU);
-
- cudaMemcpy(output, outputOnGPU, sizeof(double) * N, cudaMemcpyDeviceToHost);
-
- for (int i = 0; i < N; ++i) {
- }
- }
入力を5倍する単純なプログラム。まずは中間言語PTXを確認する。コマンドで以下を入力(環境は CUDA 2.3で32ビット)。
nvcc -arch sm_13 -O3 -ptx test.cu
結果を適当に抜粋。
ld.global.f64 %fd1, [%r6+0]; mov.f64 %fd2, 0d4014000000000000; // 5 mul.f64 %fd3, %fd1, %fd2; st.global.f64 [%r8+0], %fd3;
%r6には入力のアドレス、%r8には出力のアドレスが入っている。まあ妥当なコードか。次に実行コードCUBINを確認する。次のコマンドを打って、さらにdecudaを使って逆アセンブルする。
nvcc -arch sm_13 -O3 --cubin test.cu
結果を抜粋。
000028: 10008011 00000003 mov.b32 $r4, 0x00000000 000030: 10008015 04014003 mov.b32 $r5, 0x40140000 000038: d00e0209 80800780 mov.b64 $r2, g[$r1] 000040: e0040409 80000780 mul.rn.f64 $r2, $r2, $r4 000050: d00e0009 a0800781 mov.end.b64 g[$r0], $r2
これをみると次のことが推測できる。
$ ./deviceQuery CUDA Device Query (Runtime API) version (CUDART static linking) There is 1 device supporting CUDA Device 0: "GeForce 8400 GS" CUDA Driver Version: 2.30 CUDA Runtime Version: 2.30 CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 1 Total amount of global memory: 267714560 bytes Number of multiprocessors: 1 Number of cores: 8 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.40 GHz Concurrent copy and execution: No Run time limit on kernels: No Integrated: No Support host page-locked memory mapping: No Compute mode: Default (multiple host threads can use this device simultaneously)
$ ./deviceQuery CUDA Device Query (Runtime API) version (CUDART static linking) There are 2 devices supporting CUDA Device 0: "Tesla C1060" CUDA Driver Version: 2.30 CUDA Runtime Version: 2.30 CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 3 Total amount of global memory: 4294705152 bytes Number of multiprocessors: 30 Number of cores: 240 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 16384 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.30 GHz Concurrent copy and execution: Yes Run time limit on kernels: No Integrated: No Support host page-locked memory mapping: Yes Compute mode: Default (multiple host threads can use this device simultaneously)
$ ./bandwidthTest Running on...... device 0:GeForce 8400 GS Quick Mode Host to Device Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1926.5 Quick Mode Device to Host Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1396.0 Quick Mode Device to Device Bandwidth . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 4050.7
$ ./bandwidthTest Running on...... device 0:Tesla C1060 Quick Mode Host to Device Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 4687.9 Quick Mode Device to Host Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 3680.1 Quick Mode Device to Device Bandwidth . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 73361.7