test/unit/test_matrix.cpp · 74e911e6b61c4b9689095668fbdd030bb46ab60e · arbor-sim / arbor

Performance/copy to swap (#1027) · a316dd87

thorstenhater authored 4 years ago

Remove a redundant copy in favor of a swap operation for a gain in performance;
especially on GPU since copies are synchronous. Similarly, instead of solving the 
linear system into an intermediate array, write output directly into the target.

Here is the effect on the busyring benchmark (swapped pas -> hh) with 8192 cells on a
V100 GPU (time for model-run in seconds).
```
|----------+--------------------------------+------------------------------------|
| Baseline | fvm_lowered_cell: copy -> swap | matrix: solve + copy -> solve_into |
|----------+--------------------------------+------------------------------------|
|    2.230 |                          2.199 |                              2.129 |
|    2.231 |                          2.209 |                              2.132 |
|    2.225 |                          2.209 |                              2.136 |
|    2.227 |                          2.186 |                              2.130 |
|    2.220 |   ...

Unverified

a316dd87

test_matrix.cpp 4.98 KiB