Add micro-benchmark for event delivery setup (#359)

Add a micro-benchmark that illustrates scaling issues with event-setup on cell groups with many cells. The benchmark also illustrates some alternative approaches that are significant optimizations over the current approach. Fixes #357.

Add micro-benchmark for event delivery setup (#359)
Add a micro-benchmark that illustrates scaling issues with event-setup on cell groups with many cells. The benchmark also illustrates some alternative approaches that are significant optimizations over the current approach. Fixes #357.
1c86fdb2 · Ben Cumming · Sam Yates · d9f38b2a · 1c86fdb2 · 1c86fdb2
Commit 1c86fdb2 authored 7 years ago by Ben Cumming Committed by Sam Yates 7 years ago
--- a/.ycm_extra_conf.py
+++ b/.ycm_extra_conf.py
@@ -53,6 +53,8 @@ flags = [
    '-I',
    'modcc',
    '-I',
+    'tests/ubench/google-benchmark/include',
+    '-I',
    '/cm/shared/apps/cuda/8.0.44/include',
    '-DARB_HAVE_GPU'
 ]

--- a/tests/ubench/CMakeLists.txt
+++ b/tests/ubench/CMakeLists.txt
@@ -4,6 +4,7 @@ include(ExternalProject)
 set(bench_sources
    accumulate_functor_values.cpp
+    event_setup.cpp
 )
 set(bench_sources_cuda
@@ -63,7 +64,7 @@ foreach(bench_src ${bench_sources})
    add_executable("${bench_exe}" EXCLUDE_FROM_ALL "${bench_src}")
    add_dependencies("${bench_exe}" gbench)
    target_include_directories("${bench_exe}" PRIVATE "${gbench_install_dir}/include")
-    target_link_libraries("${bench_exe}" "${gbench_install_dir}/lib/libbenchmark.a")
+    target_link_libraries("${bench_exe}" LINK_PUBLIC "${gbench_install_dir}/lib/libbenchmark.a")
    list(APPEND bench_exe_list ${bench_exe})
 endforeach()

--- a/tests/ubench/README.md
+++ b/tests/ubench/README.md
@@ -186,3 +186,80 @@ groups with at least 10k compartments in total.
 | 10k    | 0.94 | 1.09 | 2.42 | 11.1  |
 | 100k   | 0.98 | 1.59 | 2.36 | 11.4  |
 | 1000k  | 1.13 | 1.63 | 2.36 | 11.4  |
+---
+### `event_setup`
+#### Motivation
+Post synaptic events are generated by the communicator after it gathers the local spikes.
+One set of events is generated for each cell group, in an unsorted `std::vector<post_synaptic_event>`.
+Each cell group must take this unsorted vector, store the events, and for each integration interval generate a list events that are sorted first by target gid, then by delivery time.
+As it is implemented, this step is a significant serialization bottleneck on the GPU back end, where one thread must process many events before copying them to the GPU.
+This benchmark tries to understand the behavior of the current implementation, and test some alternatives.
+#### Implementations
+Three implementations are considered:
+1. Single Queue (1Q) method (the current approach)
+    1. All events to be delivered to a cell group are pushed into a heap
+       based queue, ordered according to delivery time.
+    2. To build the list of events to deliver before `tfinal`, events are
+       popped off the queue until the head of the queue is an event to be
+       delivered at or after `tfinal`. These events are `push_back`ed onto
+       a `std::vector`.
+    3. The event vector is `std::stable_sort`ed on target gid.
+2. Multi Queue (NQ) method
+    1. One queue is maintained for each cell in the cell group. The first
+       phase pushes events into these smaller queues.
+    2. The queues are visited one by one, and events before `tfinal` are
+       `push_back` onto the single `std::vector`.
+With this approach the events are partitioned by target gid for free, and the overheads of pushing and popping onto shorter queues should see speedup.
+2. Multi Vector (NV) method
+    1. A very similar approach to the NQ method, with a `std::vector`
+       of events maintained for each cell instead of a priority queue.
+    2. Events are `push_back`ed onto the vectors, which are then sorted
+       and searched for the sub-range of events to be delivered in the next
+       integration interval.
+This approach has the same complexity as the NQ approach, but is a more "low-level" approach that uses `std::sort` to obtain, as opposed to the ad-hoc heap sort of popping from a queue.
+#### Results
+Platform:
+* Xeon(R) CPU E5-2650 v4 (Haswell 12 cores @ 2.20 GHz)
+* Linux 3.10.0
+* gcc version 6.3.0
+The benchmark varies the number of cells in the cell group, and the mean number of events per cell. The events are randomly generated in the interval `t in [0, 1]` and `target gid in {0, ..., ncells-1}`, with uniform distribution for both time and gid.
+Below are benchmark results for 1024 events per cell as the number of cells varies.
+For one cell there is little benefit with the NQ over 1Q, because in this case the only difference is avoiding the stable sort by gid.
+The NV method is faster by over 2X for one cell, and the speedup increases to 7.8x for 10k cells.
+Overall, maintaining seperate queues for each cell is much faster for more than one cell per cell group, and the additional optimizations of the NV method are significant enough to justifiy the more complicated implementation.
+*time in ms*
+|method|  1 cell  |   10 cells |  100 cells | 1k cells |  10k cells |
+|------|----------|------------|------------|----------|------------|
+|1Q    |   0.0597 | 1.139 | 18.74 | 305.90 | 5978.3 |
+|nQ    |   0.0526 | 0.641 |  6.71 |  83.50 | 1113.1 |
+|nV    |   0.0249 | 0.446 |  4.77 |  52.71 |  769.7 |
+*speedup relative to 1Q method*
+|method|  1 cell   |  10 cells |  100 cells | 1k cells |  10k cells |
+|------|-----------|-----------|------------|----------|------------|
+|1Q    |  1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
+|nQ    |  1.1 | 1.8 | 2.8 | 3.7 | 5.4 |
+|nV    |  2.4 | 2.6 | 3.9 | 5.8 | 7.8 |
--- a/tests/ubench/event_setup.cpp
+++ b/tests/ubench/event_setup.cpp
+// Compare methods for performing the "event-setup" step for mc cell groups.
+// The key concern is how to take an unsorted set of events
+//
+// TODO: We assume that the cells in a cell group are numbered contiguously,
+// i.e. 0:ncells-1. The cells in an mc_cell_group are not typically thus,
+// instead a hash table is used to look up the cell_group local index from the
+// gid. A similar lookup should be added to theses tests, to more accurately
+// reflect the mc_cell_group implementation.
+//
+// TODO: The staged_events output is a vector of postsynaptic_spike_event, not
+// a deliverable event.
+#include <random>
+#include <vector>
+#include <event_queue.hpp>
+#include <backends/event.hpp>
+#include <benchmark/benchmark.h>
+using namespace arb;
+std::vector<postsynaptic_spike_event> generate_inputs(size_t ncells, size_t ev_per_cell) {
+    std::vector<postsynaptic_spike_event> input_events;
+    std::default_random_engine engine;
+    std::uniform_int_distribution<cell_gid_type>(0u, ncells);
+    std::mt19937 gen;
+    std::uniform_int_distribution<cell_gid_type>
+        gid_dist(0u, ncells-1);
+    std::uniform_real_distribution<float>
+        time_dist(0.f, 1.f);
+    input_events.reserve(ncells*ev_per_cell);
+    for (std::size_t i=0; i<ncells*ev_per_cell; ++i) {
+        postsynaptic_spike_event ev;
+        auto gid = gid_dist(gen);
+        auto t = time_dist(gen);
+        ev.target = {cell_gid_type(gid), cell_lid_type(0)};
+        ev.time = t;
+        ev.weight = 0;
+        input_events.push_back(ev);
+    }
+    return input_events;
+}
+void single_queue(benchmark::State& state) {
+    using pev = postsynaptic_spike_event;
+    const std::size_t ncells = state.range(0);
+    const std::size_t ev_per_cell = state.range(1);
+    // state
+    std::vector<pev> input_events = generate_inputs(ncells, ev_per_cell);
+    event_queue<pev> events;
+    while (state.KeepRunning()) {
+        // push events into a single queue
+        for (const auto& e: input_events) {
+            events.push(e);
+        }
+        // pop from queue to form single sorted vector
+        std::vector<pev> staged_events;
+        staged_events.reserve(events.size());
+        while (auto e = events.pop_if_before(1.f)) {
+            staged_events.push_back(*e);
+        }
+        // sort the staged events in order of target id
+        std::stable_sort(
+            staged_events.begin(), staged_events.end(),
+            [](const pev& l, const pev& r) {return l.target.gid<r.target.gid;});
+        // TODO: calculate the partition ranges. This overhead is not included in
+        // this benchmark, however this method is that much slower already, that
+        // illustrating this wouldn't change the conclusions.
+        // clobber contents of queue for next round of benchmark
+        events.clear();
+        benchmark::ClobberMemory();
+    }
+}
+void n_queue(benchmark::State& state) {
+    using pev = postsynaptic_spike_event;
+    const std::size_t ncells = state.range(0);
+    const std::size_t ev_per_cell = state.range(1);
+    auto input_events = generate_inputs(ncells, ev_per_cell);
+    // state
+    std::vector<event_queue<pev>> event_lanes(ncells);
+    std::vector<size_t> part(ncells+1);
+    while (state.KeepRunning()) {
+        part[0] = 0;
+        // push events into the queue corresponding to target cell
+        for (const auto& e: input_events) {
+            event_lanes[e.target.gid].push(e);
+        }
+        // pop from queue to form single sorted vector
+        std::vector<pev> staged_events;
+        staged_events.reserve(input_events.size());
+        size_t i=0;
+        for (auto& lane: event_lanes) {
+            while (auto e = lane.pop_if_before(1.f)) {
+                staged_events.push_back(*e);
+            }
+            part[++i] = staged_events.size();
+        }
+        // clobber lanes for the next round of benchmarking
+        for (auto& lane: event_lanes) {
+            lane.clear();
+        }
+        benchmark::ClobberMemory();
+    }
+}
+void n_vector(benchmark::State& state) {
+    using pev = postsynaptic_spike_event;
+    const std::size_t ncells = state.range(0);
+    const std::size_t ev_per_cell = state.range(1);
+    auto input_events = generate_inputs(ncells, ev_per_cell);
+    // state
+    std::vector<std::vector<pev>> event_lanes(ncells);
+    std::vector<size_t> part(ncells+1);
+    std::vector<size_t> ext(ncells);
+    struct ev_lt_pred {
+        bool operator()(float t, const pev& ev) { return t<ev.time; }
+        bool operator()(const pev& ev, float t) { return ev.time<t; }
+    };
+    // NOTE: this is a "full" implementation, that can handle the case where
+    // input_events contains events that are to be delivered after the current
+    // delivery interval. The event_lanes vectors keep undelivered events.
+    while (state.KeepRunning()) {
+        ext.clear();
+        // push events into a per-cell vectors (unsorted)
+        for (const auto& e: input_events) {
+            event_lanes[e.target.gid].push_back(e);
+        }
+        // sort each per-cell queue and keep track of the subset of sorted
+        // events that are to be delivered in this interval.
+        for (auto& lane: event_lanes) {
+            std::sort(lane.begin(), lane.end(),
+                      [](const pev& l, const pev& r) {return l.time<r.time;});
+            ext.push_back(
+                std::distance(
+                    lane.begin(),
+                    std::lower_bound(lane.begin(), lane.end(), 1.f, ev_lt_pred())));
+        }
+        // calculate partition of output buffer according to target cell gid
+        part[0] = 0;
+        for (size_t i=0; i<ncells; ++i) {
+            part[i+1] = part[i] + ext[i];
+        }
+        // copy events into the output flat buffer
+        std::vector<postsynaptic_spike_event> staged_events(part.back());
+        auto b = staged_events.begin();
+        for (size_t i=0; i<ncells; ++i) {
+            auto bi = event_lanes[i].begin();
+            std::copy(bi, bi+ext[i], b+part[i]);
+        }
+        // remove events that were delivered from the event lanes
+        auto i=0u;
+        for (auto& lane: event_lanes) {
+            auto b = lane.begin();
+            lane.erase(b, b+ext[i++]);
+        }
+        // clobber contents of lane for next round of benchmark
+        for (auto& lane: event_lanes) {
+            lane.clear();
+        }
+        benchmark::ClobberMemory();
+    }
+}
+void run_custom_arguments(benchmark::internal::Benchmark* b) {
+    for (auto ncells: {1, 10, 100, 1000, 10000}) {
+        for (auto ev_per_cell: {128, 256, 512, 1024, 2048, 4096}) {
+            b->Args({ncells, ev_per_cell});
+        }
+    }
+}
+//BENCHMARK(run_original)->Apply(run_custom_arguments);
+BENCHMARK(single_queue)->Apply(run_custom_arguments);
+BENCHMARK(n_queue)->Apply(run_custom_arguments);
+BENCHMARK(n_vector)->Apply(run_custom_arguments);
+BENCHMARK_MAIN();