-
thorstenhater authored
Remove a redundant copy in favor of a swap operation for a gain in performance; especially on GPU since copies are synchronous. Similarly, instead of solving the linear system into an intermediate array, write output directly into the target. Here is the effect on the busyring benchmark (swapped pas -> hh) with 8192 cells on a V100 GPU (time for model-run in seconds). ``` |----------+--------------------------------+------------------------------------| | Baseline | fvm_lowered_cell: copy -> swap | matrix: solve + copy -> solve_into | |----------+--------------------------------+------------------------------------| | 2.230 | 2.199 | 2.129 | | 2.231 | 2.209 | 2.132 | | 2.225 | 2.209 | 2.136 | | 2.227 | 2.186 | 2.130 | | 2.220 | ...
Unverifieda316dd87
test_matrix.cpp 4.98 KiB