Commits · 97cef8e8fb878060df8437e6f59ff1a442149522 · arbor-sim / arbor

Nov 07, 2017
- add Juelich to the license copyright statement (#382) · 97cef8e8
  Ben Cumming authored 7 years ago
  
  97cef8e8
Nov 03, 2017

Optimise host-side event wrangling [8] (#369) · 0726685c

Ben Cumming authored 7 years ago and

Sam Yates committed 7 years ago

Optimize host-side postsynaptic event queues:

* Have communicator store stage local postsynaptic events per cell instead of per cell group.
* Pass (by reference) the relevant subrange of vectors of per-cell events to the cell group `enqueue_events` method.
* Identify calls to cell group `enqueue_events` and `advance` with an integration epoch identification allowing the use of an efficient thread-safe data structure for the cell group event stores.
* Perform merge of new events and currently stored events not for delivery in current epoch in parallel across cells in `mc_cell_group::enqueue_events`.
* Store one event binner object per cell in `mc_cell_group` to avoid hash table lookup overheads in the `binning_kind::following` case, and simplifying the `event_binner` class.
* Add convenience time comparison functional object `event_time_less`.

Fixes #368.

0726685c

Nov 02, 2017

Add mechanism parameter setting/new implementation. (#377) · b7623d13

Sam Yates authored 7 years ago

Fixes #350 

* Replace parameter_list with mechanism_spec.
* Add prototype for mechanism parameter schema checking.
* Allow mechanism weights to be set after construction.
* Combine range parameters on density mechanisms by linear contribution in CVs.
* Cable segment electrical parameters are now member variables.
* Publish mechanism parameter information through new method `mechanism::field_info`; note this will be replaced/improved in upcoming dynamic mechanism catalog work.
* Access mechanism parameter scalars and range data via `mechanism::field_view_ptr` and `mechanism::field_value_ptr` methods.
* Allow mechanism 'global' parameters to be set via a method of specializing mechanisms (and giving them corresponding aliases).
* Extend recipe interface to allow querying of per-cell-kind global information for use by cell group implementations.
* Add unit tests for above - note that linear density mechanism parameter test is tightly coupled with the FVM discretization scheme.

b7623d13

Nov 01, 2017
- Add support for NMODL files without INITIAL blocks [1] (#372) · 8fbe804a
  Ben Cumming authored 7 years ago and Sam Yates committed 7 years ago
```
Fixes #367.

* Add empty INITIAL block if NMODL file does not supply one.
```
  8fbe804a
Oct 02, 2017

Add micro-benchmark for event delivery setup (#359) · 1c86fdb2

Ben Cumming authored 7 years ago and

Sam Yates committed 7 years ago

Add a micro-benchmark that illustrates scaling issues with event-setup on cell groups with many cells.

The benchmark also illustrates some alternative approaches that are significant optimizations over the current approach.

Fixes #357.

1c86fdb2

Sep 28, 2017

Rename NestMC references, names etc. to Arbor. (#363) · d9f38b2a

Sam Yates authored 7 years ago

* Use ARB_ and arb_ as variable prefixes in place of NMC_ and nmc_.
* Replace references to 'NestMC' and 'NEST MC' to refer instead to Arbor.
* Use 'arbor' as the sim name in generated validation data.
* Reflow long-line paragraphs in `tests/ubench/README.md`.
* Change names of CUDA mechanism and CUDA kernel libraries to include arbor name.

d9f38b2a

Change nest::mc namespaces to arb (#362) · d9f99489
Sam Yates authored 7 years ago
```
Change nest::mc namespaces to arb
```
d9f99489

Sep 27, 2017
- Update readme with arbor web links (#361) · 8be7757c
  Ben Cumming authored 7 years ago and Sam Yates committed 7 years ago
```
Update README.md to use badge from updated Travis CI page, and updated references to the old nestmc-proto repository.
```
  8be7757c
- Fix attribution for the Cephes library. (#360) · b2126475
  Vasileios Karakasis authored 7 years ago
```
* Fix attribution for the Cephes library.

* Small language fix
```
  b2126475
Sep 25, 2017

Finish Seperable CUDA compilation (#356) · ddbece13

Ben Cumming authored 7 years ago and

Sam Yates committed 7 years ago

Separate compilation for all CUDA code.

* Move all CUDA kernels to their own .cu files, together with C++ function wrappers.
* Compile all CUDA .cu files to a single static library.
* Merge gpu and multicore backend validation tests.
* Simply and clean up cruft from CMakeLists.txt files.

ddbece13

Sep 21, 2017

Seperable compilation of mechanism kernels on GPU (#353) · 3c283219

Ben Cumming authored 7 years ago and

Sam Yates committed 7 years ago

Separable compilation of the CUDA kernels generated by modcc from NMODL files.

CMake scripts:
* Update the `build_modules()` helper function to cleanly handle calls to modcc that generate multiple output files.
* Add a new library target `gpu_mechanisms` for the separately compiled CUDA kernels and the implementation of their C wrappers.
* Reduce verbosity of compilation messages.

* Simplify mechanism C++ namespace use: move everything in nest::mc::mechanisms::gpu::_mechanism-name_ into `nest::mc::gpu`, and similarly for multicore mechanism implementations, ions.
* Remove template parameters for `value_type` and `size_type` from all of the 
mechanism implementations, and use `fvm_value_type` and `fvm_size_type` everywhere instead.

modcc changes:
* Modify `CUDAPrinter` to keep track of 3 text buffers, one each for 
  "implementation", "interface" and "implementation interface":
* Write the CUDA implementation interface to `X_impl.hpp`, comprising the definition of the mechanism-specific 'X_ParamParck' struct used to pass function arguments to the CUDA kernels.
* Write the CUDA kernels and C wrappers to `X_impl.cu`.
* Write the public C++ mechanism interface (with calls to implementation wrappers) to `X.hpp`.
* Modify modcc driver to support multiple generated output files.

3c283219

Conditionally compile AVX2 transcendentals unit tests (#355) · 5dc77b6c
Vasileios Karakasis authored 7 years ago
```
Only compile AVX2 intrinsics unit tests if NMC_VECTORIZE_TARGET=AVX2 is defined.

Fixes #351.
```
5dc77b6c

Sep 20, 2017

Stand alone CUDA compilation for threshold_watcher in gpu backend (#345) · 180a7ace

Ben Cumming authored 7 years ago and

Sam Yates committed 7 years ago

Refactor the threshold_watcher and stack data structures in the gpu backend so that they are amenable to separable compilation.

* Make `gpu::stack<T>` have a host-only interface that wraps a POD type `gpu::stack_base<T>`.
* Implement a `push_back(stack_base, value)` method in `backends/gpu/kernels/stack.hpp` that is visible only to device code.
* Move `test_thresholds` kernel to a .cu file, replacing template parameters with types provided by `backends/fvm_types.hpp`.
* Add a simple C function interface, callable from host side code, defined in `backends/gpu/threshold_common.hpp`.
* Simplify the `gpu::impl::padded_size` function (both to read and in terms of efficiency).
* Use `typeid` as the default for pretty-printing types in the memory back end.
* Update the `test_gpu_stack` unit test to support new gpu stack interface.
* Fix bug in the `test_spikes` unit test, which was not running the GPU back end in the cuda unit tests.

180a7ace

Remove intrinsics unit test that does not compile (#352) · f937973f
Ben Cumming authored 7 years ago and Sam Yates committed 7 years ago
```
The unit test for AVX2 intrinsics does not compile with gcc.
This is a quick fix to get master to compile, while the test is fixed.
```
f937973f

AVX2 transcendentals intrinsics (#329) · a2393eea

Vasileios Karakasis authored 7 years ago

Provides the following transcendentals intrinsics:

* `nmc_mm256_exp_pd`
* `nmc_mm256_log_pd`
* `nmc_mm256_pow_pd`
* `nmc_mm256_frexp_pd`

The first three are the equivalent of the corresponding SVML intrinsics without the `nmc` prefix.

The last one is used by the `log` function, but I decided it's nice to have it public.

All results are tested against full precision standard library implementation and provide equal results (except for the `pow` case). `NaN`s and infinities are treated according to the standard.

Limitations:

* Subnormals are treated as zeros by `frexp` and as a result by `log`.

a2393eea

Batched sampling. (#347) · a5ce1d3e

Sam Yates authored 7 years ago

Collect samples in FVM lowered cells across an integration period, prior to delivery to sampler callbacks.

Fixes #283.

* Make `multi_event_stream` a generic data structure, based on the event type. These are now used for both `deliverable_event` objects and `sample_event` objects.
* Add generic accessors for event data, separating access to event metadata (cell index, time) from payload (weight, sampling info etc.).
* Make a flat, CUDA-friendly data structure describing the `multi_event_queue` marked event state, for passing to mechanisms and kernels. This also brings the underlying representations of the `multicore` and `gpu` versions of `multi_event_queue` closer.
* Implement batched sampling kernel for GPU.
* Additional utility function: `util::is_sorted_by`.

Overhead of implementation and sampling process is negligible for CPU, and with no samples less than 1% on GPU. Running the miniapp with 10'000 cells and sampling 1% of the somata at 0.01ms adds approximately 10% to simulation time.

a5ce1d3e

Sep 11, 2017

add Travis-CI build status badge to readme (#346) · 02c4ad37

Ben Cumming authored 7 years ago

Status of the master branch is displayed at the top of the README.
The status is based on the result of the automated daily build.

02c4ad37

Basic CI support with TravisCI (#340) · 137c5b5f

Ben Cumming authored 7 years ago

Add support for continuous integration with Travis CI.
This implements bare bones support that can be extended over time.

Travis CI test environments:

    All use gcc 5.
    Test the serial distributed back end with serial and cthread threading backends.
    Test mpi with cthread.
    The tbb test failed sporadically because CMake, so it is disabled for now.

The test script:

    Builds the unit tests, global_communication tests and miniapp.
    Asserts that all unit and global_communication tests pass.
    Asserts that the miniapp runs successfully.
        does not test miniapp output for now.

There is plenty of scope for improving the tests.
A key improvement will be to use validated output for the validation and miniapp
to provide some validation.

There were some small fixes required to make the tests pass on Travis

    communication/mpi.hpp now sets default size and rank values of 1 and 0 respectively
    to allow all unit tests to pass when built with MPI.
    The wrappers around MPI API calls use const_cast to support MPI implementations that
    are not "const aware".
    A missing header was added to tests/unit/test_range to make std::unordered_multimap
    available.`

137c5b5f

Sep 05, 2017

Move common primitive types across fvm implementations to common header (#344) · 0de63143

Sam Yates authored 7 years ago

* Add `backends/fvm_types.hpp` as a single location for shared types across multicore and gpu fvm implementations.
* Use the `fvm_value_type` and `fvm_size_type` defined in `fvm_types.hpp` for the corresponding class-local `value_type` and `size_type` definitions.

0de63143

Sep 04, 2017

Move nvcc-only code from memory to backends::gpu (#342) · 40e2f523

Ben Cumming authored 7 years ago and

Sam Yates committed 7 years ago

* Move gpu-kernel code from memory to backends/gpu

A small step towards seperate back end compilation for CUDA.
Move the following code to the gpu backend:
- the memory::fill* wrappers and fill kernel
- the managed_ptr type
  - only used in backends::gpu
  - has `__device__ __host__` members for dual host-device use.

* update unit&validation tests

40e2f523

Sep 01, 2017

Update global comms tests for new sampling API. (#338) · 3bcb8f97

Sam Yates authored 7 years ago

* Avoid format securiy warning/error in `mpi_listener.hpp`
* Update recipe classes in `test_communicator.cpp` and `test_domain_decomposition.cpp`.
* Align test names in `test_domain_decomposition.cpp` with those in unit tests.
* Fix `spike_gids` bug in `test_all2all` routine.
* Replace test assertions with `AssertionResult` returns in `test_ring` and `test_all2all`.
* Simplify no-extra-events check in `test_ring` and `test_all2all`.

3bcb8f97

New sampling API implementation. (#335) · 8739fd55

Sam Yates authored 7 years ago

Towards resolution of issue #283.

* Replace probe/sample infrastructure with new API as outlined in the `sampling_api.rst` documentation.
* Separate `cell` probe information from `cell` description object.
* Add `--list` option to `tsplot`, to summarize available time series for plotting together with their metadata.
* Add `--sample-dt` option to validation tests.
* Change validation time series comparison behaviour: linf distance metric now estimated by comparing reference data points against linearly interpreted simulation samples, rather than the other way around.
* Add utility class `any_ptr` which allows lightweight and type-checked access to a type-erased pointer.
* Replace `singleton_recipe` with two simple recipe base classes used for unit and validation tests, in `tests/simple_recipes.hpp`.
* Simplify RSS cell implementation.
* Add statistical functions for testing of probabilistic methods in `tests/unit/stats.hpp` (and .cpp):
   * simple summary stats class;
   * Kolmogorov–Smirnov one-sided statistic and CDF;
   * approximate Poisson CDF.
* Simplify and refactor miniapp `sample_trace` code.
* Add new utility class `handle_set` for managing a collection of integer handles. (Near-simplest implementation.)
* Relax const-ness constraints on `transform_iterator`.

8739fd55

Aug 24, 2017

Basic Sphinx Documentation (#328) · 610fd857

Ben Cumming authored 7 years ago and

Sam Yates committed 7 years ago

Adds support for building documentation with Sphinx from reStructuredText-formatted files in the `doc` subdirectory. Automatic building has been verified with ReadTheDocs.

* Add basic documentation to the `doc` path.
* Use a git submodule and associated CMake to pull in ReadTheDocs theme at configuration time.

610fd857

Aug 23, 2017

Separate load balancer from domain decomposition description (#334) · 12852eb4

Ben Cumming authored 7 years ago and

Sam Yates committed 7 years ago

* Make `domain_decomposition` object into a flat description object.
* Put original decomposition algorithm into a new function `partitioned_load_balance` that returns a `domain_decomposition` object.

12852eb4

Aug 18, 2017

Better TBB CMake integration (#331) · 6dce9fa4

Ben Cumming authored 7 years ago and

Sam Yates committed 7 years ago

* Add support for CMake scripts provided by TBB.
* Update required cmake version to 3.0.

* hack to get linking to work on Cray PE

* improve comments and remove redundant include in CMakeLists

* firewall the tbb cmake files

* tbb threading back end to_string includes version number

6dce9fa4

Refactor domain decomposition for arbitrary gid distribution. (#326) · cba9d458

Ben Cumming authored 7 years ago and

Sam Yates committed 7 years ago

Changes to `domain_decomposition`:
* `domain_decomposition` performs two pass load balancing in constructor:
1. first pass performs global load balance,
2. second pass distributes cells locally between cpu and gpu cell_groups.
The current logic for this is very simple and naive, and will be replaced with a load balancer which returns a lighter domain decomposition description in a follow up pull request.
* Provides a simple `group_description` type that contains gid, `cell_kind` and target backend information for `cell_group_factory`.

Changes to `communicator`:
* Constructor takes a `domain_decomposition` and recipe.
* The interface for adding connections and constructing connection table has been removed, as this is now performed within the constructor.
* Construction is more complicated, as connections are partitioned by source gid which requires multiple passes over the connection information in the recipe.
* `make_event_queues` updated: spikes and connections are now partitioned by source domain, and an optimization dynamically chooses to iterate over either connection or spike list, whichever is shorter.
* The `exchange` method now sorts `local_spikes` before global gather to facilitate the optimized spike/connection searching.

Changes to `miniapp`:
* Automatically use gpu if available and compiled with gpu support.
* Banner prints out useful information about number of cores, gpus and ranks.
* Remove -g cell group size flag.

Changes to `cell_group`:
* `cell_group` interface take a list of gid values instead of a range.
* Updated internal `cell_group` logic to convert between gid and local indices: use a vector for local index to gid map, and a hash table for gid to local index in `cell_group` implementations that need this lookup.

changes to unit tests
* tests for the domain decomposition
* tests for the communicator that test ring and all2all networks

cba9d458

Jul 29, 2017

Add curly braces to one test to silence gcc 7.1 warnings (#327) · 98537ff2

Ben Cumming authored 7 years ago and

Sam Yates committed 7 years ago

There was a spurious warning with gcc 7.1.x in `test_algorithms` as a result of using a Google Test macro (`EXPECT_EQ`) inside an `if` block without surrounding `{}`. This fix adds the missing braces to silence the warnings.

98537ff2

Jul 18, 2017

Feature/node description (#325) · ecb1b049

Ben Cumming authored 7 years ago and

Sam Yates committed 7 years ago

Refactor hardware-querying code.

Comprises part of the work on issue #318.

* Create `src/hardware` path for hardware-querying code.
* Create `nest::mc::hw` namespace for hardware-querying code.
* Move memory, affinity and power code from util and threading code to the new diretory and namespace.
* Add `nest::mc::threading::num_threads` function that attempts to determine the number of threads to use according to an environment variable, or by using the low level hardware querying if no environment variable is set.
* Add hardware query for counting the number of available GPUs.

ecb1b049

Jul 07, 2017

306 spike from file (#314) · 90b7966a

Wouter Klijn authored 7 years ago

Created a new cell kind that injects a user-suplied stream of spikes, along with corresponding cell_group implementation called dss_cell_group.

The miniapp is updated to test this functionality
  * An ostream spike time parser
  * A new command line argument -I with a Path.

90b7966a

Jul 05, 2017
- Remove depricated model::add_artificial_spike interface (#319) · dc4f9283
  Ben Cumming authored 7 years ago and Sam Yates committed 7 years ago
  
  dc4f9283
Jul 04, 2017

Improve reduce by key GPU performance. (#301) · 1f188dcd

Ben Cumming authored 7 years ago and

Sam Yates committed 7 years ago

Optimized reduce by key used by the GPU back end when accumulating synapse current contributions to compartment currents. This leads to significant speedup in the miniapp for cells with few compartments and many synapses.

* Implement `gpu::reduce_by_key` device function that uses warp intrinsics to perform reduction between threads in a warp before using a global atomic update to store the result.
* Add unit tests for `reduce_by_key` functionality.
* Add micro benchmarks that compare against using CUDA atomics.
* Modify `CudaPrinter` modcc class to emit `reduce_by_key` in place of `cudaAtomicAdd` functions.

Some improvements to meter reporting:
* Shorten names of metering regions in miniapp to make them easier to grep.
* JSON is no longer used as an intermediate data type when gathering distributed meters into a single report, instead conversion to JSON is performed just before writing to file.
* Add a print function for summarizing meter results to a stream.

1f188dcd

Jul 03, 2017

Fix bug with comparison operator for spikes (#316) · 1a58e003

Ben Cumming authored 7 years ago

The less than operator for spikes was not in the nest::mc namespace,
so it was not being picked up by STL algoriths and containers.
This patch makes it a friend operator of the `nest::mc::basic_spike`,
and adds some unit tests to varify that STL algorithms can use
containers of spikes.

fixes #315.

1a58e003

Jun 28, 2017

Fix improper integration finished test. (#308) · dc83ff16

Sam Yates authored 7 years ago

Fixes #307.

* Simplify integration complete test logic in `fvm_multicell.hpp`.
* Re-include `multi_event_stream` pretty-print output for debugging.

dc83ff16

Jun 23, 2017

Issue/241 regular spiking source cell (#287) · 70d1a1b4

Wouter Klijn authored 7 years ago

Replace the 20 artificial spikes we inserted until now with spikes generated with a frequency spiking cell.
This is a cell that spikes regularly for a set period.
This connections outgoing from this cell mirror all the outgoing connections of cells with gid % 20 == 0.
This means that 5% extra connections have been added to the model to implement the equivalent behaviour with an 'implemented' neuron.

To implement this behavoir a new cell_group type has been introduced.
The fs_cell is extremely small and is not implemented on the backends.

Includes:
Unit test for the rss_cell functionality

70d1a1b4

Jun 20, 2017
- Remove improper _POSIX_C_SOURCE check. (#300) · 1590585a
  Sam Yates authored 7 years ago
```
Remove improper _POSIX_C_SOURCE check.
Fixes #299.
```
  1590585a
Jun 19, 2017

Implement device-side event delivery. · 9319b302

Sam Yates authored 7 years ago

Finalizes #184 and includes fix for #285. This PR constitutes a brutal rebase of the `devel/async-integration` branch onto current master: git patch, baby.

Implements asynchronous integration of cells within an `fvm_multicell` instance together with the code to implement efficient device-side event delivery.

Summarized patch notes from the development branch:

* Add four new views in mechanism state:
   * `vec_ci`: cv index to cell index (w.r.t. one lowered multicell instance) map.
   * `vec_t`: current or integration starting time, indexed by cell.
   * `vec_t_to`: integration stopping time, indexed by cell.
   * `vec_dt`: pre-computed delta between `vec_t` and `vec_t_to`, to address performance regression on multicore backend.
* Extend notion of indexed view in `modcc` to encompass `CellIndexedVariable` variables, which have a per-cell value.
* Add/change 'built-in' variables `t`, `t_to`, and `dt` for NMODL.
* Additional utility functions:
   * `util::minmax_value`: return one-pass reduction for minimum and maximum.
   * `util::fill`: range-compatible wrapper for `std::fill`.
   * `util::empty` now works across arbitrary sequences (defaults to calling `empty()` method if it exists).
   * Extends C++17-like functionality for path testing and manipulation (c.f. PR #233).
* Use a cached copy of the per-cell time vector for queries in `fvm_multicell`.
* Modifies back-end matrix assembly, threshold detection and stimulus implementations to be asynchronous-friendly.
* Rename some back-end mechanism variables for clarity, now that there are multiple sorts of cell index vectors present. `cell_index` for example has been renamed to `cell_cv_divs` where it acts as a partition division of cv indices (length 1+#cells), and `cell_to_cv` where it acts as a map from cell to index of first cv for that cell (length #cells).
* Move common event-delivery types (`target_handle`, `deliverable_event`) to `src/backends/event.hpp`.
* Add `multi_event_stream` data structure for `gpu` and `multicore` backends, which handles multiple streams of events, bulk loaded but individually popped.
* Add unit tests for `multi_event_stream`.
* Document event delivery process and abstraction in `src/backends/event_delivery.md`.
* Virtualize `mechanism::deliver_events`.
* Perform event delivery in `fvm_multicell::step_integration()`, using the backend-provided interfaces.
* Add zero dt check/support to matrix state implementations.
* Add zero dt unit tests for multicore and gpu back-end matrices.
* Add debugging helper `util::sepval` for printing/tracing container values.
* Add trace csv output option to miniapp.
* Add GPU kernel and unit test for end-of-integration time step test. By default, keep using the copy-to-host-and-test method, as it is faster for cell counts up to circa 10k; investigate adaptive/threshold solutions in the future.
* Explicitly compute and store per-cell and per-compartment `dt` from integration time bounds.
* Determine lower bound on number of integration steps per interval in order to avoid explicit checking of minimum cell times each step.
* Avoid any time value checking for samplers in the integration loop if none of them could be triggered in the interval.

9319b302

Jun 15, 2017

AVX512 CMake target (#288) · 153aeaee

Vasileios Karakasis authored 7 years ago

Adds a new AVX512 target for processors supporting only the core AVX512 functionality, which currently means SkyLake Xeon processors.

153aeaee

Jun 14, 2017

modcc: AVX2 backend for mechanisms (#282) · 02d41881

Vasileios Karakasis authored 7 years ago

Add AVX2 instrinsics back end for Haswell and Broadwell architectures.

We're still 3.5% and 5% slower than the icc `#pragma` version on Haswell and Broadwell,
respectively.

02d41881

May 24, 2017

Add required thread synchronization to matrix kernel. (#280) · bd1e56a5

Sam Yates authored 7 years ago

There is a potential data race in the `assemble_matrix_interleaved` kernel, where threads in a different warp can overwrite the `buffer_v` and `buffer_i` values before they are used to update the `d` and `rhs` vectors.

This race has been exercised in the asynchronous event delivery branch.

* Add `__syncthreads()` to assemble matrix interleaved kernel after `d` and `rhs` update.

bd1e56a5

May 19, 2017

Fix consistency issue of the SIMD i/f of modcc (#278) · 56fd0532

Vasileios Karakasis authored 7 years ago

The `emit_gather()` function emitted the "wrong" instruction in terms of its
arguments but the instruction actually generated was correct, because
the `simd_printer` was passing the arguments to `emit_gather()` in a
different order, which was though the correct order for the finally emitted
instruction. Complicated? This commit cleans this up.

56fd0532