Skip to content
Snippets Groups Projects
user avatar
thorstenhater authored
Fuse kernels `gather` and `vec_minus` into a single kernel `set_dt_impl` for a small
performance improvement.

Here is the effect on the busyring benchmark (swapped pas -> hh) with 8192 cells on a 
V100 GPU (time for `model-run` in seconds).

```
|----------+-------|
| Baseline | After |
|----------+-------|
|    2.318 | 2.314 |
|    2.335 | 2.307 |
|    2.345 | 2.315 |
|    2.333 | 2.306 |
|    2.331 | 2.320 |
|----------+-------|
|    2.318 | 2.306 |
|----------+-------|
```
5f9f2a5a
Name Last commit Last update