Skip to content
Snippets Groups Projects
Commit 1c86fdb2 authored by Ben Cumming's avatar Ben Cumming Committed by Sam Yates
Browse files

Add micro-benchmark for event delivery setup (#359)

Add a micro-benchmark that illustrates scaling issues with event-setup on cell groups with many cells.

The benchmark also illustrates some alternative approaches that are significant optimizations over the current approach.

Fixes #357.
parent d9f38b2a
No related branches found
No related tags found
No related merge requests found
...@@ -53,6 +53,8 @@ flags = [ ...@@ -53,6 +53,8 @@ flags = [
'-I', '-I',
'modcc', 'modcc',
'-I', '-I',
'tests/ubench/google-benchmark/include',
'-I',
'/cm/shared/apps/cuda/8.0.44/include', '/cm/shared/apps/cuda/8.0.44/include',
'-DARB_HAVE_GPU' '-DARB_HAVE_GPU'
] ]
......
...@@ -4,6 +4,7 @@ include(ExternalProject) ...@@ -4,6 +4,7 @@ include(ExternalProject)
set(bench_sources set(bench_sources
accumulate_functor_values.cpp accumulate_functor_values.cpp
event_setup.cpp
) )
set(bench_sources_cuda set(bench_sources_cuda
...@@ -63,7 +64,7 @@ foreach(bench_src ${bench_sources}) ...@@ -63,7 +64,7 @@ foreach(bench_src ${bench_sources})
add_executable("${bench_exe}" EXCLUDE_FROM_ALL "${bench_src}") add_executable("${bench_exe}" EXCLUDE_FROM_ALL "${bench_src}")
add_dependencies("${bench_exe}" gbench) add_dependencies("${bench_exe}" gbench)
target_include_directories("${bench_exe}" PRIVATE "${gbench_install_dir}/include") target_include_directories("${bench_exe}" PRIVATE "${gbench_install_dir}/include")
target_link_libraries("${bench_exe}" "${gbench_install_dir}/lib/libbenchmark.a") target_link_libraries("${bench_exe}" LINK_PUBLIC "${gbench_install_dir}/lib/libbenchmark.a")
list(APPEND bench_exe_list ${bench_exe}) list(APPEND bench_exe_list ${bench_exe})
endforeach() endforeach()
......
...@@ -186,3 +186,80 @@ groups with at least 10k compartments in total. ...@@ -186,3 +186,80 @@ groups with at least 10k compartments in total.
| 10k | 0.94 | 1.09 | 2.42 | 11.1 | | 10k | 0.94 | 1.09 | 2.42 | 11.1 |
| 100k | 0.98 | 1.59 | 2.36 | 11.4 | | 100k | 0.98 | 1.59 | 2.36 | 11.4 |
| 1000k | 1.13 | 1.63 | 2.36 | 11.4 | | 1000k | 1.13 | 1.63 | 2.36 | 11.4 |
---
### `event_setup`
#### Motivation
Post synaptic events are generated by the communicator after it gathers the local spikes.
One set of events is generated for each cell group, in an unsorted `std::vector<post_synaptic_event>`.
Each cell group must take this unsorted vector, store the events, and for each integration interval generate a list events that are sorted first by target gid, then by delivery time.
As it is implemented, this step is a significant serialization bottleneck on the GPU back end, where one thread must process many events before copying them to the GPU.
This benchmark tries to understand the behavior of the current implementation, and test some alternatives.
#### Implementations
Three implementations are considered:
1. Single Queue (1Q) method (the current approach)
1. All events to be delivered to a cell group are pushed into a heap
based queue, ordered according to delivery time.
2. To build the list of events to deliver before `tfinal`, events are
popped off the queue until the head of the queue is an event to be
delivered at or after `tfinal`. These events are `push_back`ed onto
a `std::vector`.
3. The event vector is `std::stable_sort`ed on target gid.
2. Multi Queue (NQ) method
1. One queue is maintained for each cell in the cell group. The first
phase pushes events into these smaller queues.
2. The queues are visited one by one, and events before `tfinal` are
`push_back` onto the single `std::vector`.
With this approach the events are partitioned by target gid for free, and the overheads of pushing and popping onto shorter queues should see speedup.
2. Multi Vector (NV) method
1. A very similar approach to the NQ method, with a `std::vector`
of events maintained for each cell instead of a priority queue.
2. Events are `push_back`ed onto the vectors, which are then sorted
and searched for the sub-range of events to be delivered in the next
integration interval.
This approach has the same complexity as the NQ approach, but is a more "low-level" approach that uses `std::sort` to obtain, as opposed to the ad-hoc heap sort of popping from a queue.
#### Results
Platform:
* Xeon(R) CPU E5-2650 v4 (Haswell 12 cores @ 2.20 GHz)
* Linux 3.10.0
* gcc version 6.3.0
The benchmark varies the number of cells in the cell group, and the mean number of events per cell. The events are randomly generated in the interval `t in [0, 1]` and `target gid in {0, ..., ncells-1}`, with uniform distribution for both time and gid.
Below are benchmark results for 1024 events per cell as the number of cells varies.
For one cell there is little benefit with the NQ over 1Q, because in this case the only difference is avoiding the stable sort by gid.
The NV method is faster by over 2X for one cell, and the speedup increases to 7.8x for 10k cells.
Overall, maintaining seperate queues for each cell is much faster for more than one cell per cell group, and the additional optimizations of the NV method are significant enough to justifiy the more complicated implementation.
*time in ms*
|method| 1 cell | 10 cells | 100 cells | 1k cells | 10k cells |
|------|----------|------------|------------|----------|------------|
|1Q | 0.0597 | 1.139 | 18.74 | 305.90 | 5978.3 |
|nQ | 0.0526 | 0.641 | 6.71 | 83.50 | 1113.1 |
|nV | 0.0249 | 0.446 | 4.77 | 52.71 | 769.7 |
*speedup relative to 1Q method*
|method| 1 cell | 10 cells | 100 cells | 1k cells | 10k cells |
|------|-----------|-----------|------------|----------|------------|
|1Q | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
|nQ | 1.1 | 1.8 | 2.8 | 3.7 | 5.4 |
|nV | 2.4 | 2.6 | 3.9 | 5.8 | 7.8 |
// Compare methods for performing the "event-setup" step for mc cell groups.
// The key concern is how to take an unsorted set of events
//
// TODO: We assume that the cells in a cell group are numbered contiguously,
// i.e. 0:ncells-1. The cells in an mc_cell_group are not typically thus,
// instead a hash table is used to look up the cell_group local index from the
// gid. A similar lookup should be added to theses tests, to more accurately
// reflect the mc_cell_group implementation.
//
// TODO: The staged_events output is a vector of postsynaptic_spike_event, not
// a deliverable event.
#include <random>
#include <vector>
#include <event_queue.hpp>
#include <backends/event.hpp>
#include <benchmark/benchmark.h>
using namespace arb;
std::vector<postsynaptic_spike_event> generate_inputs(size_t ncells, size_t ev_per_cell) {
std::vector<postsynaptic_spike_event> input_events;
std::default_random_engine engine;
std::uniform_int_distribution<cell_gid_type>(0u, ncells);
std::mt19937 gen;
std::uniform_int_distribution<cell_gid_type>
gid_dist(0u, ncells-1);
std::uniform_real_distribution<float>
time_dist(0.f, 1.f);
input_events.reserve(ncells*ev_per_cell);
for (std::size_t i=0; i<ncells*ev_per_cell; ++i) {
postsynaptic_spike_event ev;
auto gid = gid_dist(gen);
auto t = time_dist(gen);
ev.target = {cell_gid_type(gid), cell_lid_type(0)};
ev.time = t;
ev.weight = 0;
input_events.push_back(ev);
}
return input_events;
}
void single_queue(benchmark::State& state) {
using pev = postsynaptic_spike_event;
const std::size_t ncells = state.range(0);
const std::size_t ev_per_cell = state.range(1);
// state
std::vector<pev> input_events = generate_inputs(ncells, ev_per_cell);
event_queue<pev> events;
while (state.KeepRunning()) {
// push events into a single queue
for (const auto& e: input_events) {
events.push(e);
}
// pop from queue to form single sorted vector
std::vector<pev> staged_events;
staged_events.reserve(events.size());
while (auto e = events.pop_if_before(1.f)) {
staged_events.push_back(*e);
}
// sort the staged events in order of target id
std::stable_sort(
staged_events.begin(), staged_events.end(),
[](const pev& l, const pev& r) {return l.target.gid<r.target.gid;});
// TODO: calculate the partition ranges. This overhead is not included in
// this benchmark, however this method is that much slower already, that
// illustrating this wouldn't change the conclusions.
// clobber contents of queue for next round of benchmark
events.clear();
benchmark::ClobberMemory();
}
}
void n_queue(benchmark::State& state) {
using pev = postsynaptic_spike_event;
const std::size_t ncells = state.range(0);
const std::size_t ev_per_cell = state.range(1);
auto input_events = generate_inputs(ncells, ev_per_cell);
// state
std::vector<event_queue<pev>> event_lanes(ncells);
std::vector<size_t> part(ncells+1);
while (state.KeepRunning()) {
part[0] = 0;
// push events into the queue corresponding to target cell
for (const auto& e: input_events) {
event_lanes[e.target.gid].push(e);
}
// pop from queue to form single sorted vector
std::vector<pev> staged_events;
staged_events.reserve(input_events.size());
size_t i=0;
for (auto& lane: event_lanes) {
while (auto e = lane.pop_if_before(1.f)) {
staged_events.push_back(*e);
}
part[++i] = staged_events.size();
}
// clobber lanes for the next round of benchmarking
for (auto& lane: event_lanes) {
lane.clear();
}
benchmark::ClobberMemory();
}
}
void n_vector(benchmark::State& state) {
using pev = postsynaptic_spike_event;
const std::size_t ncells = state.range(0);
const std::size_t ev_per_cell = state.range(1);
auto input_events = generate_inputs(ncells, ev_per_cell);
// state
std::vector<std::vector<pev>> event_lanes(ncells);
std::vector<size_t> part(ncells+1);
std::vector<size_t> ext(ncells);
struct ev_lt_pred {
bool operator()(float t, const pev& ev) { return t<ev.time; }
bool operator()(const pev& ev, float t) { return ev.time<t; }
};
// NOTE: this is a "full" implementation, that can handle the case where
// input_events contains events that are to be delivered after the current
// delivery interval. The event_lanes vectors keep undelivered events.
while (state.KeepRunning()) {
ext.clear();
// push events into a per-cell vectors (unsorted)
for (const auto& e: input_events) {
event_lanes[e.target.gid].push_back(e);
}
// sort each per-cell queue and keep track of the subset of sorted
// events that are to be delivered in this interval.
for (auto& lane: event_lanes) {
std::sort(lane.begin(), lane.end(),
[](const pev& l, const pev& r) {return l.time<r.time;});
ext.push_back(
std::distance(
lane.begin(),
std::lower_bound(lane.begin(), lane.end(), 1.f, ev_lt_pred())));
}
// calculate partition of output buffer according to target cell gid
part[0] = 0;
for (size_t i=0; i<ncells; ++i) {
part[i+1] = part[i] + ext[i];
}
// copy events into the output flat buffer
std::vector<postsynaptic_spike_event> staged_events(part.back());
auto b = staged_events.begin();
for (size_t i=0; i<ncells; ++i) {
auto bi = event_lanes[i].begin();
std::copy(bi, bi+ext[i], b+part[i]);
}
// remove events that were delivered from the event lanes
auto i=0u;
for (auto& lane: event_lanes) {
auto b = lane.begin();
lane.erase(b, b+ext[i++]);
}
// clobber contents of lane for next round of benchmark
for (auto& lane: event_lanes) {
lane.clear();
}
benchmark::ClobberMemory();
}
}
void run_custom_arguments(benchmark::internal::Benchmark* b) {
for (auto ncells: {1, 10, 100, 1000, 10000}) {
for (auto ev_per_cell: {128, 256, 512, 1024, 2048, 4096}) {
b->Args({ncells, ev_per_cell});
}
}
}
//BENCHMARK(run_original)->Apply(run_custom_arguments);
BENCHMARK(single_queue)->Apply(run_custom_arguments);
BENCHMARK(n_queue)->Apply(run_custom_arguments);
BENCHMARK(n_vector)->Apply(run_custom_arguments);
BENCHMARK_MAIN();
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment