Fix reduce-by-key CUDA (#737)
Fix bug in CUDA reduce_by_key implementation on V100 or later GPUs. The bug was not triggered for current use cases of the algorithm in Arbor, though it will be a problem when more than one reduction is to be performed in a single kernel invocation, which is required for ac cumulating both current and conductance values. * Use warp-synchronous aware operations to avoid problems on V100. * Simplify reduction kernel. * Rename `run_length` ancillary data structure to `key_set_pos`. * Add unit tests that trigger the incorrect behaviour observed in #736. Fixes #736.
Please register or sign in to comment