u/ElectronGoBrrr — reddlx

I'm trying to efficiently axiswise sum the output of a 16x16 interaction. I recently discovered this magic intrinsic that is not only much faster at summing along the x axis than my previous manual loop with synchronizations, but also deterministic:
__shfl_down_sync

As i understand this only works within warps, and to sum along y i'll need a stride larger than the warpsize. So what do i do? Is there a trick to do reductions along other dimensions than x?
Can i "transpose" the thread-local data in the threadblock, so i can do another reduction along x, but it being effectively on y data?

Here's the semi-pseudo code of what i'm trying to do:

// blockdim=16,16,1
__global__ void NbInteractionKernel(const Data* const superClusters, ForceEnergy* const results) {
  static_assert(SuperCluster::maxParticles == 16, "This kernel relies on   SuperCluster::nParticles being 16");
  //__shared__ PData pqd[SuperCluster::maxParticles * 2];
  __shared__ Data data;

  __shared__ ForceEnergy feSc0[SuperCluster::maxParticles];
  __shared__ ForceEnergy feSc1[SuperCluster::maxParticles];

  // Setup inputdata //


  const ForceEnergy forceEnergy = LJ::ComputeParticleParticleNB(data);

  // warpreduce along x-dim
  ForceEnergy forceEnergyX = forceEnergy;
  for (int offset = 8; offset &gt; 0; offset &gt;&gt;= 1)
  {
    forceEnergyX.force.x += __shfl_down_sync(0xffffffff, forceEnergyX.force.x, offset, 16);
  // repeat for other components..
  }
  if (threadIdx.x == 0) {
    feSc1[threadIdx.y] = forceEnergyX;
  }



  // warpreduce along y-dim
  ForceEnergy forceEnergyY = forceEnergy;
  for (/*??*/)
  {
    // ????
    // repeat for other components..
  }
  if (threadIdx.x == 0) {
    feSc0[threadIdx.x] = forceEnergy;
  }


  // Push results to Global Mem //
}

# Criterias

The output must be deterministic
I cant put the data in shared memory 16bytes*16^2=4kb, which is quite alot above the budget for this kernel