▲ 7 r/CUDA
I'm trying to efficiently axiswise sum the output of a 16x16 interaction. I recently discovered this magic intrinsic that is not only much faster at summing along the x axis than my previous manual loop with synchronizations, but also deterministic:__shfl_down_sync
As i understand this only works within warps, and to sum along y i'll need a stride larger than the warpsize. So what do i do? Is there a trick to do reductions along other dimensions than x?
Can i "transpose" the thread-local data in the threadblock, so i can do another reduction along x, but it being effectively on y data?
Here's the semi-pseudo code of what i'm trying to do:
// blockdim=16,16,1
__global__ void NbInteractionKernel(const Data* const superClusters, ForceEnergy* const results) {
static_assert(SuperCluster::maxParticles == 16, "This kernel relies on SuperCluster::nParticles being 16");
//__shared__ PData pqd[SuperCluster::maxParticles * 2];
__shared__ Data data;
__shared__ ForceEnergy feSc0[SuperCluster::maxParticles];
__shared__ ForceEnergy feSc1[SuperCluster::maxParticles];
// Setup inputdata //
const ForceEnergy forceEnergy = LJ::ComputeParticleParticleNB(data);
// warpreduce along x-dim
ForceEnergy forceEnergyX = forceEnergy;
for (int offset = 8; offset > 0; offset >>= 1)
{
forceEnergyX.force.x += __shfl_down_sync(0xffffffff, forceEnergyX.force.x, offset, 16);
// repeat for other components..
}
if (threadIdx.x == 0) {
feSc1[threadIdx.y] = forceEnergyX;
}
// warpreduce along y-dim
ForceEnergy forceEnergyY = forceEnergy;
for (/*??*/)
{
// ????
// repeat for other components..
}
if (threadIdx.x == 0) {
feSc0[threadIdx.x] = forceEnergy;
}
// Push results to Global Mem //
}
# Criterias
- The output must be deterministic
- I cant put the data in shared memory 16bytes*16^2=4kb, which is quite alot above the budget for this kernel
u/ElectronGoBrrr — 12 days ago