-
CMake and build refactoring * Use CUDA as first-class language (leading to CMake 3.9 minimum version requirement). * Use 'modern CMake' interface libraries for compiler options, include file and library dependency tracking. Interface library targets: * `arbor-deps`: compiler options and library requirements for the `libarbor.a` static library, as governed by configure-time options and environment. * `arbor-private-headers`: include path for non-installed headers, as required by unit tests and arbor itself. * `arbor-aux`: helper classes and utilities used across tests and examples. * `ext-json`, `ext-tclap`, `ext-tbb`, `ext-benchmark`, `ext-sphinx_rtd_theme`: externally maintained software that we include (directly or via submodule) in the `ext/` subdirectory. * Single static library `libarbor.a` includes all built-in modules and CUDA objects. * Simply configuration options: * `ARB_WITH_TRACE`, `ARB_AUTORUN_MODCC_ON_CHANGES` `ARB_SYSTEM_TYPE` removed. * External `modcc` is provided by `ARB_MODCC` configuration option; if provided `modcc` is still buildable, but is not included in the default target. * `ARB_PRIVATE_TBBLIB`, defaulting to `OFF`, instructs the build to make TBB from the included submodule. * Extend `ErrorTarget` functionality to provide a dummy target or an error target based on a condition. * Generate header version defines and library version variables based on git status and project version, via new script `include/git-source-id`. * All generated binaries now placed in `bin/` subdirectory at build. * Install targets installs: public headers (incomplete); static library; `modcc` tool; `lmorpho` executable; `html` documentation (examples, tests and validation data are currently not installed). * Executable targets have had the `.exe` suffix removed; unit tests are labelled `unit` (arbor unit tests), `unit-modcc` (modcc unit tests), `unit-local` (distributed tests with local context), `unit-mpi` (distributed tests with MPI context). * More graceful handling of configure-time detection of `nrniv`, Julia and required Julia modules for validation data generation. * Add `cmake/FindJulia.cmake`, `cmake/FindTBB.cmake` package finders, and adjust `cmake/FindUnwind.cmake` to use link library-style properties. * Adjust travis script to test `unit-local` and `unit-mpi` if appropriate. * Simply documentation `conf.py`. Source relocation and reorganization * All external project sources and files moved to `ext/`. * Source code refactoring to decouple library-using code from the configure-time definitions that govern arbor behaviour: removes conditional code in public headers that depends upon `ARB_WITH_X`-type definitions at compile time. Affected code is is in the public interfaces for MPI, the threading implementation, and the profiler. * Remove `util/debug.hpp`; split out functionality for pretty-printing from assertion handling. * Make FVM cell non-physical voltage check a run-time cell-group parameter. * Move spike double buffer implementation to `simulation.cpp`. * Make timer utility wrap POSIX `clock_gettime` independent of threading configuration. * Make `mpi_error` derive from `system_error` and follow C++11 `system_error` semantics. * `EXPECTS` macro replaced by `arb_assert` macro. * JSON dependency removed from `libarbor.a` and header files: moved to auxiliary library. * Publicly visible macros garner an `ARB_` prefix as required. * Move SWC test file to `test/unit` directory. * Work-in-progress splitting of public from private includes: as a convention not entirely adhered to as yet, private headers within arbor source are included with `""`, public headers with `<>`. Modcc interface changes * Expose via `--namespace` option the functionality that sets the namespace in generated code. * Use `--profile` option to add profiler hooks to generated code; uses public function interface directly rather than `PE/PL` macros in order to avoid public `PE` and `PL` defines.
ad1c78ab
Library microbenchmarks
The benchmarks here are intended to:
- answer questions regarding choices of implementation in the library where performance is a concern;
- track the performance behaviour of isolated bits of library functionality across different platforms.
Building and running
The micro-benchmarks are not built by default. After configuring CMake, they can be built with
make ubenches
. Each benchmark is provided by a stand-alone C++ source file in tests/ubench
;
the resulting executables are found in test/ubench
relative to the build directory.
Google benchmark is used as a harness. It is included
in the repository via a git submodule, and the provided CMake scripts will attempt to
run git submodule update --init
on the submodule if it appears not to have been instantiated.
Adding new benchmarks
New benchmarks are added by placing the corresponding implementation as a stand-alone
.cpp
file in tests/ubench
and adding the name of this file to the list bench_sources
in tests/ubench/CMakeLists.txt
.
Each new benchmark should also have a corresponding entry in this README.md
, describing
the motivation for the test and summarising at least one benchmark result.
Results in this file are destined to become out of date; we should consider some form of semi-automated registration of results in a database should the number of benchmarks become otherwise unwieldy.
Benchmarks
accumulate_functor_values
Motivation
The problem arises when constructing the partition of an integral range where the sizes of each sub-interval are given by a function of the index. This requires the computation of the sizes
di = Σj<i f(j).
One approach using the provided range utilities is to use std::partial_sum
with
util::transform_view
and util::span
; the other is to simply write a loop that
performs the accumulation directly. What is the extra cost, if any, of the
transform-based approach?
The micro-benchmark compares the two implementations, where the function is a simple integer square operation, called either via a function pointer or a functional object.
Results
Results here are presented only for vector size n equal to 1024.
Platform:
- Xeon E3-1220 v2 with base clock 3.1 GHz and max clock 3.5 GHz.
- Linux 4.4.34
- gcc version 6.2.0
- clang version 3.8.1
Compiler | direct/function | transform/function | direct/object | transform/object |
---|---|---|---|---|
g++ -O3 | 907 ns | 2090 ns | 907 ns | 614 ns |
clang++ -O3 | 1063 ns | 533 ns | 1051 ns | 532 ns |
cuda_compare_and_reduce
Motivation
One possible mechanism for determining if device-side event delivery had exhausted all events is to see if the start and end of each event-span of each cell were equal or not. This is equivalent to the test:
∃i: a[i] < b[i]?
for device vectors a and b.
How expensive is it simply to copy the vectors over to the host and compare there? Benchmarking indicated that for vectors up to approximately 10^5 elements, it was significiantly faster to copy to host, despite the limited host–device bandwidth.
This non-intuitive result appears to be the result of the high overhead of cudaMalloc
;
pre-allocating space on the device for the reduction result restores expected
performance behaviour.
Implementations
Four implementations are considered:
-
Copy both vectors to host, run short-circuit compare.
-
Use
thrust::zip_iterator
andthrust::any_of
to run the reduction on the device. -
Use a short custom cuda kernel to run the reduction, using
__syncthreads_or
and a non-atomic write to the result. -
Use the same cuda kernel as above, but pre-allocate the device memory to store the result.
Note: a fairer host-based comparison would interleave comparison with data transfer.
Results
Results here are presented for vector size n equal to 256, 512, 32768 and 262144, with the two vectors equal.
Platform:
- Xeon(R) CPU E5-2650 v4 with base clock 2.20 GHz and max clock 2.9 GHz.
- Tesla P100-PCIE-16GB
- Linux 3.10.0
- gcc version 5.3.0
- nvcc version 8.0.61
n | host copy | thrust | custom cuda | custom cuda noalloc |
---|---|---|---|---|
256 | 18265 ns | 41221 ns | 23255 ns | 16440 ns |
512 | 18466 ns | 286331 ns | 265113 ns | 16335 ns |
32768 | 102880 ns | 296836 ns | 265169 ns | 16758 ns |
262144 | 661724 ns | 305209 ns | 269095 ns | 19792 ns |
cuda_reduce_by_index
Motivation
The reduction by key pattern with repeated keys is used when "point process" mechanism contributions to currents are collected. More than one point process, typically synapses, can be attached to a compartment, and when their contributions are computed and added to the per-compartment current in parallel, care must be taken to avoid race conditions. Early versions of Arbor used cuda atomic operations to perform the accumulation, which works quite well up to a certain point. However, performance with atomics decreases as the number of synapses per compartment increases, i.e. as the number of threads performing simultatneous atomic updates on the same location increases.
Implementations
Two implementations are considered:
-
Perform reductions inside each warp, which is a multi-step process:
- threads inside each warp determine which other threads they must perform a reduction with
- threads perform a binary reduction tree operation using warp shuffle intrinsics
- one thread performs a CUDA atomic update for each key.
- note that this approach takes advantage of the keys being sorted in ascending order
-
The naive (however simple) use of CUDA atomics.
Results
Platform:
- Xeon(R) CPU E5-2650 v4 (Haswell 12 cores @ 2.20 GHz)
- Tesla P100-PCIE-16GB
- Linux 3.10.0
- gcc version 5.2.0
- nvcc version 8.0.61
Results are presented as speedup for warp intrinsics vs atomics, for both
single and double precision. Note that the P100 GPU has hardware support for
double precision atomics, and we expect much larger speedup for double
precision on Keplar GPUs that emulate double precision atomics with CAS. The
benchmark updates n
locations, each with an average density of d
keys per
location. This is equivalent to n
compartments with d
synapses per
compartment. Atomics are faster for the case where both n
and d
are small,
however the gpu is backend is for throughput simulations, with large cell
groups with at least 10k compartments in total.
float
n | d=1 | d=10 | d=100 | d=1000 |
---|---|---|---|---|
100 | 0.75 | 0.80 | 1.66 | 10.7 |
1k | 0.76 | 0.87 | 3.15 | 12.5 |
10k | 0.87 | 1.14 | 3.52 | 14.0 |
100k | 0.92 | 1.34 | 3.58 | 15.5 |
1000k | 1.18 | 1.43 | 3.53 | 15.2 |
double
n | d=1 | d=10 | d=100 | d=1000 |
---|---|---|---|---|
100 | 0.91 | 0.94 | 1.82 | 9.0 |
1k | 0.89 | 0.99 | 2.38 | 10.0 |
10k | 0.94 | 1.09 | 2.42 | 11.1 |
100k | 0.98 | 1.59 | 2.36 | 11.4 |
1000k | 1.13 | 1.63 | 2.36 | 11.4 |
event_setup
Motivation
Post synaptic events are generated by the communicator after it gathers the local spikes.
One set of events is generated for each cell group, in an unsorted std::vector<post_synaptic_event>
.
Each cell group must take this unsorted vector, store the events, and for each integration interval generate a list events that are sorted first by target gid, then by delivery time.
As it is implemented, this step is a significant serialization bottleneck on the GPU back end, where one thread must process many events before copying them to the GPU.
This benchmark tries to understand the behavior of the current implementation, and test some alternatives.
Implementations
Three implementations are considered:
-
Single Queue (1Q) method (the current approach)
- All events to be delivered to a cell group are pushed into a heap based queue, ordered according to delivery time.
- To build the list of events to deliver before
tfinal
, events are popped off the queue until the head of the queue is an event to be delivered at or aftertfinal
. These events arepush_back
ed onto astd::vector
. - The event vector is
std::stable_sort
ed on target gid.
-
Multi Queue (NQ) method
- One queue is maintained for each cell in the cell group. The first phase pushes events into these smaller queues.
- The queues are visited one by one, and events before
tfinal
arepush_back
onto the singlestd::vector
.
With this approach the events are partitioned by target gid for free, and the overheads of pushing and popping onto shorter queues should see speedup.
- Multi Vector (NV) method
- A very similar approach to the NQ method, with a
std::vector
of events maintained for each cell instead of a priority queue. - Events are
push_back
ed onto the vectors, which are then sorted and searched for the sub-range of events to be delivered in the next integration interval.
- A very similar approach to the NQ method, with a
This approach has the same complexity as the NQ approach, but is a more "low-level" approach that uses std::sort
to obtain, as opposed to the ad-hoc heap sort of popping from a queue.
Results
Platform:
- Xeon(R) CPU E5-2650 v4 (Haswell 12 cores @ 2.20 GHz)
- Linux 3.10.0
- gcc version 6.3.0
The benchmark varies the number of cells in the cell group, and the mean number of events per cell. The events are randomly generated in the interval t in [0, 1]
and target gid in {0, ..., ncells-1}
, with uniform distribution for both time and gid.
Below are benchmark results for 1024 events per cell as the number of cells varies.
For one cell there is little benefit with the NQ over 1Q, because in this case the only difference is avoiding the stable sort by gid. The NV method is faster by over 2X for one cell, and the speedup increases to 7.8x for 10k cells. Overall, maintaining seperate queues for each cell is much faster for more than one cell per cell group, and the additional optimizations of the NV method are significant enough to justifiy the more complicated implementation.
time in ms
method | 1 cell | 10 cells | 100 cells | 1k cells | 10k cells |
---|---|---|---|---|---|
1Q | 0.0597 | 1.139 | 18.74 | 305.90 | 5978.3 |
nQ | 0.0526 | 0.641 | 6.71 | 83.50 | 1113.1 |
nV | 0.0249 | 0.446 | 4.77 | 52.71 | 769.7 |
speedup relative to 1Q method
method | 1 cell | 10 cells | 100 cells | 1k cells | 10k cells |
---|---|---|---|---|---|
1Q | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
nQ | 1.1 | 1.8 | 2.8 | 3.7 | 5.4 |
nV | 2.4 | 2.6 | 3.9 | 5.8 | 7.8 |
default_construct
Motivation
The padded_allocator
code allows us to use, for example, a std::vector
for CPU-side aligned storage and padded storage (for SIMD)
instead of the memory::array
class. The latter though does not construct its elements, while a std::vector
will use the allocator's
construct
method.
For scalar values that have trivial default constructors, a std::allocator
construction with no arguments will value-initialize,
which will zero initialize any non-class values. By supplying an alternate construct
method, we can make an allocator that will
default-initialize instead, skipping any initialization for non-class values, and providing semantics similar to that of
memory::array
.
Is it worth doing so?
Implementation
The microbenchmark uses an adaptor class that replaces the allocator construct
methods to default initialize if there are no
arguments given. The benchmark creates a vector using the standard or adapted allocator, fills with the numbers from 1 to n
and takes the sum.
For comparison, the benchmark also compares the two vectors when they are initialized by a pair of iterators that provide the same enumeration from 1 to n.
Results
With this low computation-to-size ratio task, using the default constructing adaptor gives a significant performance benefit. With the iterator-pair construction however, where we would expect no performance difference, GCC (but not Clang) produces very much slower code.
Note that Clang produces overall considerably faster code.
Platform:
- Xeon E3-1220 v2 with base clock 3.1 GHz and max clock 3.5 GHz.
- Linux 4.9.75
- gcc version 7.3.1
- clang version 6.0.0
- optimization options: -O3 -march=ivybridge
Create then fill and sum
GCC
size | value-initialized | default-initialized |
---|---|---|
1 kiB | 403 ns | 331 ns |
4 kiB | 1 430 ns | 1 142 ns |
32 kiB | 12 377 ns | 8 982 ns |
256 kiB | 114 598 ns | 81 599 ns |
1024 kiB | 455 502 ns | 323 366 ns |
Clang
size | value-initialized | default-initialized |
---|---|---|
1 kib | 228 ns | 147 ns |
4 kib | 826 ns | 527 ns |
32 kib | 10 425 ns | 6 823 ns |
256 kib | 106 497 ns | 72 375 ns |
1024 kib | 430 561 ns | 293 999 ns |
Create directly from counting iterators and sum
GCC
size | value-initialized | default-initialized |
---|---|---|
1 kiB | 335 ns | 775 ns |
4 kiB | 1 146 ns | 2 920 ns |
32 kiB | 8 954 ns | 23 197 ns |
256 kiB | 81 609 ns | 193 230 ns |
1024 kiB | 322 947 ns | 763 243 ns |