Gpu/fuse set dt (#1025)
Fuse kernels `gather` and `vec_minus` into a single kernel `set_dt_impl` for a small performance improvement. Here is the effect on the busyring benchmark (swapped pas -> hh) with 8192 cells on a V100 GPU (time for `model-run` in seconds). ``` |----------+-------| | Baseline | After | |----------+-------| | 2.318 | 2.314 | | 2.335 | 2.307 | | 2.345 | 2.315 | | 2.333 | 2.306 | | 2.331 | 2.320 | |----------+-------| | 2.318 | 2.306 | |----------+-------| ```
Please register or sign in to comment