Skip to content
Snippets Groups Projects
  • thorstenhater's avatar
    Performance/copy to swap (#1027) · a316dd87
    thorstenhater authored
    Remove a redundant copy in favor of a swap operation for a gain in performance;
    especially on GPU since copies are synchronous. Similarly, instead of solving the 
    linear system into an intermediate array, write output directly into the target.
    
    Here is the effect on the busyring benchmark (swapped pas -> hh) with 8192 cells on a
    V100 GPU (time for model-run in seconds).
    ```
    |----------+--------------------------------+------------------------------------|
    | Baseline | fvm_lowered_cell: copy -> swap | matrix: solve + copy -> solve_into |
    |----------+--------------------------------+------------------------------------|
    |    2.230 |                          2.199 |                              2.129 |
    |    2.231 |                          2.209 |                              2.132 |
    |    2.225 |                          2.209 |                              2.136 |
    |    2.227 |                          2.186 |                              2.130 |
    |    2.220 |   ...
    Unverified
    a316dd87
test_matrix.cpp 4.98 KiB