A. Size of a Grid:
- gridDim.x (1Dimensional)
- gridDim.x (2Dimensional, assuming a N x N Grid)
B. Size of a Block:
- blockDim.x (1Dimensional)
- blockDim.x (2Dimensional, assuming a N x N Block)
C. Thread Local Index within its block (assuming a 1Dimensional Block):
D. Block Local Index
- blockIdx.x (1Dimensional)
- blockIdx.x (2Dimensional) --> Current Column Index (Length) of a N x N Block
- blockIdx.y (2Dimensional) --> Current Row Index (Height) of a N x N Block
E. Thread Global Index across the entire grid (assuming a 1 Dimensional Grid):
- (blockDim.x * blockIdx.x) + threadIdx.x
F. Thread Local Index within its block (assuming a 2Dimensional Block):
F-1.Obtain current column index (assuming you have a N x N Block):
- (blockIdx.x * blockDimx.x) + threadIdx.x
- (blockIdx.y * blockDimx.x) + threadIdx.y
N = 1024. You have to process N x N elements (1024 x 1024). You could decompose the grid as so: You could set the blockSize to 64. Then gridSize = numElements / blockSize --> gridSize = 1024 / 64 = 16. Maybe not the most efficient way, but since it's only an example it will do!
So your grid is composed of 4096 Blocks (64 x 64), and each Block is composed of 256 threads (16 x 16).
Total Blocks * Total Threasd per Block = 4096 * 256 = 1,048576 = N * N = 1024 * 1024.
To process each element serially, you would probably have a nested for loop:
for (each col)
for (each row)
To access each element for processing in CUDA (assuming you are storing results in a 1D array):
- (Global Row * Number of Elements) + Global Column
- Global Row = (blockIdx.y * blockDimx.x + threadIdx.y)
- Global Column = (blockIdx.x * blockDimx.x + threadIdx.x)
- Number of Elements = N = Number of elements Length wise (1024 in my example)
More quick tips in the future ...