Hardware API documentation (#707)

Update Hardware API documentation * split the domain decomposition and hardware API docs into separate pages * update hardware API to reflect new *libarbor* and *libarborenv* * add basic documentation for `optional`, `any` and `unique_any` types.

Hardware API documentation (#707)
Update Hardware API documentation * split the domain decomposition and hardware API docs into separate pages * update hardware API to reflect new *libarbor* and *libarborenv* * add basic documentation for `optional`, `any` and `unique_any` types.
1c89fbbd · Benjamin Cumming · GitHub · 49d87aba · 1c89fbbd · 1c89fbbd
Unverified Commit 1c89fbbd authored 6 years ago by Benjamin Cumming Committed by GitHub 6 years ago
--- a/doc/cpp_common.rst
+++ b/doc/cpp_common.rst
@@ -115,3 +115,40 @@ Probes
    .. cpp:member:: util::any address
           Cell-type specific location info, specific to cell kind of ``id.gid``.
+Utility Wrappers and Containers
+--------------------------------
+.. cpp:namespace:: arb::util
+.. cpp:class:: template <typename T> optional
+    A wrapper around a contained value of type :cpp:type:`T`, that may or may not be set.
+    A faithful copy of the C++17 ``std::optional`` type.
+    See the online C++ standard documentation
+    `<https://en.cppreference.com/w/cpp/utility/optional>`_
+    for more information.
+.. cpp:class:: any
+    A container for a single value of any type that is copy constructable.
+    Used in the Arbor API where a type of a value passed to or from the API
+    is decided at run time.
+    A faithful copy of the C++17 ``std::any`` type.
+    See the online C++ standard documentation
+    `<https://en.cppreference.com/w/cpp/utility/any>`_
+    for more information.
+    The :cpp:any:`arb::util` namespace also implementations of the
+    :cpp:any:`any_cast`, :cpp:any:`make_any` and :cpp:any:`bad_any_cast`
+    helper functions and types from C++17.
+.. cpp:class:: unique_any
+   Equivalent to :cpp:class:`util::any`, except that:
+      * it can store any type that is move constructable;
+      * it is move only, that is it can't be copied.
--- a/doc/cpp_domdec.rst
+++ b/doc/cpp_domdec.rst
@@ -3,207 +3,7 @@
 Domain Decomposition
 ====================
-The C++ API for defining hardware resources and partitioning a model over
+The C++ API for partitioning a model over distributed and local hardware is described here.
-distributed and local hardware is described here.
-Arbor provides two library APIs for working with hardware resources:
-* The core *libarbor* is used to *describe* the hardware resources
-  and their contexts for use in Arbor simulations.
-* The *libarborenv* provides an API for querying available hardware
-  resources (e.g. the number of available GPUs), and initializing MPI.
-Managing Hardware
-----------------
-The *libarborenv* API for querying and managing hardware resources is in the
-:cpp:any:`arbenv` namespace. This functionality is in a seperate
-library because the main Arbor library should only
-present an interface for running simulations on hardware resources provided
-by the calling application. As such, it should not provide access to how
-it manages hardware resources internally, or place restrictions on how
-the calling application selects or manages resources such as GPUs and MPI communicators.
-However, for the purpose of writing tests, examples, benchmarks and validation
-tests, functionality for detecting GPUs, managing MPI lifetimes and the like
-is neccesary. This functionality is kept in a separate library to ensure
-separation of concerns, and to provide examples of quality implementations
-of such functionality for users of the library to reuse.
-.. cpp:namespace:: arbenv
-.. cpp:function:: arb::optional<int> get_env_num_threads()
-    Tests whether the number of threads to use has been set in an environment variable.
-    First checks ``ARB_NUM_THREADS``, and if that is not set checks ``OMP_NUM_THREADS``.
-    Return value:
-    * no value: the :cpp:any:`optional` return value contains no value if the
-    * has value: the number of threads set by the environment variable.
-    Exceptions:
-    * throws :cpp:any:`std::runtime_error` if environment variable set with invalid
-      number of threads.
-    .. container:: example-code
-       .. code-block:: cpp
-         if (auto nt = arbenv::get_env_num_threads()) {
-            std::cout << "requested " << nt.value() << "threads \n";
-         }
-         else {
-            std::cout << "no enviroment variable set\n";
-         }
-.. cpp:function:: int thread_concurrency()
-   Attempts to detect the number of available CPU cores. Returns 1 if unable to detect
-   the number of cores.
-    .. container:: example-code
-       .. code-block:: cpp
-         // Set num_threads to value from environment variable if set,
-         // otherwise set it to the available number of cores.
-         int num_threads = 0;
-         if (auto nt = arbenv::get_env_num_threads()) {
-            num_threads = nt.value();
-         }
-         else {
-            num_threads = arbenv::thread_concurrency();
-         }
-.. cpp:function:: int default_gpu()
-   Detects if a GPU is available, and returns the 
-   Return value:
-   * non-negative value: if a GPU is available, the index of the selected GPU is returned. The index will be in the range ``[0, num_gpus)`` where ``num_gpus`` is the number of GPUs detected using the ``cudaGetDeviceCount`` `CUDA API call <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html>`_.
-   * -1: if no GPU available, or if Arbor was built without GPU support.
-    .. container:: example-code
-       .. code-block:: cpp
-         if (arbenv::default_gpu()>-1) {}
-            std::cout << "a GPU is available\n";
-         }
-.. cpp:function:: int find_private_gpu(MPI_Comm comm)
-   stuff.
-.. cpp:class:: with_mpi
-   Purpose and functionality
-   Constructor
-   Usage notes.
-Blurb for the *libarbor*
-.. cpp:namespace:: arb
-.. cpp:class:: proc_allocation
-    Enumerates the computational resources to be used for a simulation, typically a
-    subset of the resources available on a physical hardware node.
-    .. container:: example-code
-        .. code-block:: cpp
-            // Default construction uses all detected cores/threads, and the first GPU, if available.
-            arb::proc_allocation resources;
-            // Remove any GPU from the resource description.
-            resources.gpu_id = -1;
-    .. cpp:function:: proc_allocation() = default
-        Sets the number of threads to the number detected by :cpp:func:`get_local_resources`, and
-        chooses either the first available GPU, or no GPU if none are available.
-    .. cpp:function:: proc_allocation(unsigned threads, int gpu_id)
-        Constructor that sets the number of :cpp:var:`threads` and selects :cpp:var:`gpus` available.
-    .. cpp:member:: unsigned num_threads
-        The number of CPU threads available.
-    .. cpp:member:: int gpu_id
-        The identifier of the GPU to use.
-        The gpu id corresponds to the ``int device`` parameter used by CUDA API calls
-        to identify gpu devices.
-        Set to -1 to indicate that no GPU device is to be used.
-        See ``cudaSetDevice`` and ``cudaDeviceGetAttribute`` provided by the
-        `CUDA API <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html>`_.
-    .. cpp:function:: bool has_gpu() const
-        Indicates whether a GPU is selected (i.e. whether :cpp:member:`gpu_id` is ``-1``).
-Execution Context
-----------------
-The :cpp:class:`proc_allocation` class enumerates the hardware resources on the local hardware
-to use for a simulation.
-.. cpp:namespace:: arb
-.. cpp:class:: context
-    A handle for the interfaces to the hardware resources used in a simulation.
-    A :cpp:class:`context` contains the local thread pool, and optionally the GPU state
-    and MPI communicator, if available. Users of the library do not directly use the functionality
-    provided by :cpp:class:`context`, instead they configure contexts, which are passed to
-    Arbor methods and types.
-.. cpp:function:: context make_context()
-    Local context that uses all detected threads and a GPU if any are available.
-.. cpp:function:: context make_context(proc_allocation alloc)
-    Local context that uses the local resources described by :cpp:var:`alloc`.
-.. cpp:function:: context make_context(proc_allocation alloc, MPI_Comm comm)
-    A context that uses the local resources described by :cpp:var:`alloc`, and
-    uses the MPI communicator :cpp:var:`comm` for distributed calculation.
-Here are some examples of how to create a :cpp:class:`arb::context`:
-    .. container:: example-code
-        .. code-block:: cpp
-            #include <arbor/context.hpp>
-            // Construct a non-distributed context that uses all detected available resources.
-            auto context = arb::make_context();
-            // Construct a context that:
-            //  * does not use a GPU, reguardless of whether one is available;
-            //  * uses 8 threads in its thread pool.
-            arb::proc_allocation resources(8, -1);
-            auto context = arb::make_context(resources);
-            // Construct a context that:
-            //  * uses all available local hardware resources;
-            //  * uses the standard MPI communicator MPI_COMM_WORLD for distributed computation.
-            arb::proc_allocation resources; // defaults to all detected local resources
-            auto context = arb::make_context(resources, MPI_COMM_WORLD);
 Load Balancers
 --------------
@@ -217,11 +17,11 @@ distributed with MPI communication. The returned :cpp:class:`domain_decompositio
 describes the cell groups on the local MPI rank.
 .. Note::
-    The :cpp:class:`domain_decomposition` type is simple and
+    The :cpp:class:`domain_decomposition` type is
-    independent of any load balancing algorithm, so users can supply their
+    independent of any load balancing algorithm, so users can define a
-    own domain decomposition without using one of the built-in load balancers.
+    domain decomposition directly, instead of generating it with a load balancer.
    This is useful for cases where the provided load balancers are inadequate,
-    and when the user has specific insight into running their model on the
+    or when the user has specific insight into running their model on the
    target computer.
 .. cpp:namespace:: arb

--- a/doc/cpp_dry_run.rst
+++ b/doc/cpp_dry_run.rst
@@ -74,11 +74,11 @@ To support dry-run mode we use the following classes:
    .. Note::
        While this class inherits from :cpp:class:`arb::recipe`, it breaks one of its implicit
        rules: it allows connection from gids greater than the total number of cells in a recipe,
-        :cpp:var:`ncells`.
+        :cpp:any:`ncells`.
    :cpp:class:`arb::tile` describes the model on a single domain containing :cpp:expr:`num_cells =
-    num_cells_per_tile` cells, which is to be duplicated over :cpp:var:`num_ranks`
+    num_cells_per_tile` cells, which is to be duplicated over :cpp:any:`num_ranks`
-    domains in dry-run mode. It contains information about :cpp:var:`num_ranks` which is provided
+    domains in dry-run mode. It contains information about :cpp:any:`num_ranks` which is provided
    by the following function:
    .. cpp:function:: cell_size_type num_tiles() const

--- a/doc/cpp_hardware.rst
+++ b/doc/cpp_hardware.rst
+.. _cpphardware:
+Hardware Management
+===================
+Arbor provides two library APIs for working with hardware resources:
+* The core *libarbor* is used to *describe* the hardware resources
+  and their contexts for use in Arbor simulations.
+* The *libarborenv* provides an API for querying available hardware
+  resources (e.g. the number of available GPUs), and initializing MPI.
+libarborenv
+-------------------
+The *libarborenv* API for querying and managing hardware resources is in the
+:cpp:any:`arbenv` namespace.
+This functionality is kept in a separate library to enforce
+separation of concerns, so that users have full control over how hardware resources
+are selected, either using the functions and types in *libarborenv*, or writing their
+own code for managing MPI, GPUs, and thread counts.
+.. cpp:namespace:: arbenv
+.. cpp:function:: arb::util::optional<int> get_env_num_threads()
+    Tests whether the number of threads to use has been set in an environment variable.
+    First checks ``ARB_NUM_THREADS``, and if that is not set checks ``OMP_NUM_THREADS``.
+    Return value:
+    * **no value**: the :cpp:any:`optional` return value contains no value if the
+      no thread count was specified by an environment variable.
+    * **has value**: the number of threads set by the environment variable.
+    Throws:
+    * throws :cpp:any:`std::runtime_error` if environment variable set with invalid
+      number of threads.
+    .. container:: example-code
+       .. code-block:: cpp
+         #include <arborenv/concurrency.hpp>
+         if (auto nt = arbenv::get_env_num_threads()) {
+            std::cout << "requested " << nt.value() << "threads \n";
+         }
+         else {
+            std::cout << "no environment variable set\n";
+         }
+.. cpp:function:: int thread_concurrency()
+   Attempts to detect the number of available CPU cores. Returns 1 if unable to detect
+   the number of cores.
+    .. container:: example-code
+       .. code-block:: cpp
+         #include <arborenv/concurrency.hpp>
+         // Set num_threads to value from environment variable if set,
+         // otherwise set it to the available number of cores.
+         int num_threads = 0;
+         if (auto nt = arbenv::get_env_num_threads()) {
+            num_threads = nt.value();
+         }
+         else {
+            num_threads = arbenv::thread_concurrency();
+         }
+.. cpp:function:: int default_gpu()
+   Returns the integer identifier of the first available GPU, if a GPU is available 
+   Return value:
+   * **non-negative value**: if a GPU is available, the index of the selected GPU is returned. The index will be in the range ``[0, num_gpus)`` where ``num_gpus`` is the number of GPUs detected using the ``cudaGetDeviceCount`` `CUDA API call <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html>`_.
+   * **-1**: if no GPU available, or if Arbor was built without GPU support.
+    .. container:: example-code
+       .. code-block:: cpp
+         #include <arborenv/gpu_env.hpp>
+         if (arbenv::default_gpu()>-1) {}
+            std::cout << "a GPU is available\n";
+         }
+.. cpp:function:: int find_private_gpu(MPI_Comm comm)
+   A helper function that assigns a unique GPU to every MPI rank.
+   .. Note::
+      Arbor allows at most one GPU per MPI rank, and furthermore requires that
+      an MPI rank has exclusive access to a GPU, i.e. two MPI ranks can not
+      share a GPU.
+      The task of assigning a unique GPU to each rank when more than one rank
+      have access to the same GPU(s).
+      An example use case is on systems with "fat" nodes with multiple GPUs
+      per node, in which case Arbor should be run with multiple MPI ranks
+      per node.
+      Uniquely assigning GPUs is quite difficult, and this function provides
+      what we feel is a robust implementation.
+   All MPI ranks in the MPI communicator :cpp:any:`comm` should call to
+   avoid a deadlock.
+   Return value:
+     * **non-negative integer**: the identifier of the GPU assigned to this rank.
+     * **-1**: no GPU was available for this MPI rank.
+   Throws:
+     * :cpp:any:`std::runtime_error`: if there was an error in the CUDA runtime
+       on the local or remote MPI ranks, i.e. if one rank throws, all ranks
+       will throw.
+.. cpp:class:: with_mpi
+   The :cpp:class:`with_mpi` type is a simple RAII scoped guard for MPI initialization
+   and finalization. On creation :cpp:class:`with_mpi` will call :cpp:any:`MPI_Init_thread`
+   to initialize MPI with the minimum level thread support required by Arbor, that is
+   ``MPI_THREAD_SERIALIZED``. When it goes out of scope it will automatically call
+   :cpp:any:`MPI_Finalize`.
+   .. cpp:function:: with_mpi(int& argcp, char**& argvp, bool fatal_errors = true)
+      The constructor takes the :cpp:any:`argc` and :cpp:any:`argv` arguments
+      passed to main of the calling application, and an additional flag
+      :cpp:any:`fatal_errors` that toggles whether errors in MPI API calls
+      should return error codes or terminate.
+   .. Warning::
+      Handling exceptions is difficult in MPI applications, and it is the users
+      responsibility to do so.
+      The :cpp:class:`with_mpi` scope guard attempts to facilitate error reporting of
+      uncaught exceptions, particularly in the case where one rank throws an exception,
+      while the other ranks continue executing. In this case there would be a deadlock
+      if the rank with the exception attempts to call :cpp:any:`MPI_Finalize` and
+      other ranks are waiting in other MPI calls. If this happens inside a try-catch
+      block, the deadlock stops the exception from being handled.
+      For this reason the destructor of :cpp:class:`with_mpi` only calls
+      :cpp:any:`MPI_Finalize` if there are no uncaught exceptions.
+      This isn't perfect because the other MPI ranks still deadlock,
+      however it gives the exception handling code to report the error for debugging.
+   An example workflow that uses the MPI scope guard. Note that this code will
+   print the exception error message in the case where only one MPI rank threw
+   an exception, though it would either then deadlock or exit with an error code
+   that one or more MPI ranks exited without calling :cpp:any:`MPI_Finalize`.
+    .. container:: example-code
+        .. code-block:: cpp
+            #include <exception>
+            #include <iostream>
+            #include <arborenv/with_mpi.hpp>
+            int main(int argc, char** argv) {
+                try {
+                    // Constructing guard will initialize MPI with a
+                    // call to MPI_Init_thread()
+                    arbenv::with_mpi guard(argc, argv, false);
+                    // Do some work with MPI here
+                    // When leaving this scope, the destructor of guard will
+                    // call MPI_Finalize()
+                }
+                catch (std::exception& e) {
+                    std::cerr << "error: " << e.what() << "\n";
+                    return 1;
+                }
+                return 0;
+            }
+libarbor
+-------------------
+The core Arbor library *libarbor* provides an API for:
+  * prescribing which hardware resources are to be used by a
+    simulation using :cpp:class:`arb::proc_allocation`.
+  * opaque handles to hardware resources used by simulations called
+    :cpp:class:`arb::context`.
+.. cpp:namespace:: arb
+.. cpp:class:: proc_allocation
+    Enumerates the computational resources on a node to be used for simulation,
+    specifically the number of threads and identifier of a GPU if available.
+    .. Note::
+       Each MPI rank in a distributed simulation uses a :cpp:class:`proc_allocation`
+       to describe the subset of resources on its node that it will use.
+    .. container:: example-code
+        .. code-block:: cpp
+            #include <arbor/context.hpp>
+            // default: 1 thread and no GPU selected
+            arb::proc_allocation resources;
+            // 8 threads and no GPU
+            arb::proc_allocation resources(8, -1);
+            // 4 threads and the first available GPU
+            arb::proc_allocation resources(8, 0);
+            // Construct with 
+            auto num_threads = arbenv::thread_concurrency();
+            auto gpu_id = arbenv::default_gpu();
+            arb::proc_allocation resources(num_threads, gpu_id);
+    .. cpp:function:: proc_allocation() = default
+        By default selects one thread and no GPU.
+    .. cpp:function:: proc_allocation(unsigned threads, int gpu_id)
+        Constructor that sets the number of :cpp:var:`threads` and the id :cpp:var:`gpu_id` of
+        the 
+    .. cpp:member:: unsigned num_threads
+        The number of CPU threads available.
+    .. cpp:member:: int gpu_id
+        The identifier of the GPU to use.
+        The gpu id corresponds to the ``int device`` parameter used by CUDA API calls
+        to identify gpu devices.
+        Set to -1 to indicate that no GPU device is to be used.
+        See ``cudaSetDevice`` and ``cudaDeviceGetAttribute`` provided by the
+        `CUDA API <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html>`_.
+    .. cpp:function:: bool has_gpu() const
+        Indicates whether a GPU is selected (i.e. whether :cpp:member:`gpu_id` is ``-1``).
+.. cpp:namespace:: arb
+.. cpp:class:: context
+    An opaque handle for the hardware resources used in a simulation.
+    A :cpp:class:`context` contains a thread pool, and optionally the GPU state
+    and MPI communicator. Users of the library do not directly use the functionality
+    provided by :cpp:class:`context`, instead they create contexts, which are passed to
+    Arbor interfaces for domain decomposition and simulation.
+Arbor contexts are created by calling :cpp:func:`make_context`, which returns an initialized
+context. There are two versions of :cpp:func:`make_context`, for creating contexts
+with and without distributed computation with MPI respectively.
+.. cpp:function:: context make_context(proc_allocation alloc=proc_allocation())
+    Create a local :cpp:class:`context`, with no distributed/MPI,
+    that uses local resources described by :cpp:any:`alloc`.
+    By default it will create a context with one thread and no GPU.
+.. cpp:function:: context make_context(proc_allocation alloc, MPI_Comm comm)
+    Create a distributed :cpp:class:`context`.
+    A context that uses the local resources described by :cpp:any:`alloc`, and
+    uses the MPI communicator :cpp:var:`comm` for distributed calculation.
+Contexts can be queried for information about which features a context has enabled,
+whether it has a GPU, how many threads are in its thread pool, using helper functions.
+.. cpp:function:: bool has_gpu(const context&)
+   Query if the context has a GPU.
+.. cpp:function:: unsigned num_threads(const context&)
+   Query the number of threads in a context's thread pool
+.. cpp:function:: bool has_mpi(const context&)
+   Query if the context has an MPI communicator.
+.. cpp:function:: unsigned num_ranks(const context&)
+   Query the number of distributed ranks. If the context has an MPI
+   communicator, return is equivalent to :cpp:any:`MPI_Comm_size`.
+   If the communicator has no MPI, returns 1.
+.. cpp:function:: unsigned rank(const context&)
+   Query the rank of the calling rand. If the context has an MPI
+   communicator, return is equivalent to :cpp:any:`MPI_Comm_rank`.
+   If the communicator has no MPI, returns 0.
+Here are some simple examples of how to create a :cpp:class:`arb::context` using
+:cpp:func:`make_context`.
+.. container:: example-code
+  .. code-block:: cpp
+      #include <arbor/context.hpp>
+      // Construct a context that uses 1 thread and no GPU or MPI
+      auto context = arb::make_context();
+      // Construct a context that:
+      //  * uses 8 threads in its thread pool.
+      //  * does not use a GPU, regardless of whether one is available;
+      //  * does not use MPI
+      arb::proc_allocation resources(8, -1);
+      auto context = arb::make_context(resources);
+      //  Construct one that uses:
+      //  * 4 threads and the first GPU.
+      //  * MPI_COMM_WORLD for distributed computation.
+      arb::proc_allocation resources(4, 0);
+      auto mpi_context = arb::make_context(resources, MPI_COMM_WORLD)
+Here is a more complicated example of creating a :cpp:class:`context` on a
+system where support for GPU and MPI support are conditional.
+.. container:: example-code
+  .. code-block:: cpp
+      #include <arbor/context.hpp>
+      #include <arbor/version.hpp>   // for ARB_MPI_ENABLED
+      #include <arborenv/concurrency.hpp>
+      #include <arborenv/gpu_env.hpp>
+      int main(int argc, char** argv) {
+          try {
+              arb::proc_allocation resources;
+              // try to detect how many threads can be run on this system
+              resources.num_threads = arbenv::thread_concurrency();
+              // override thread count if the user set ARB_NUM_THREADS
+              if (auto nt = arbenv::get_env_num_threads()) {
+                  resources.num_threads = nt;
+              }
+      #ifdef ARB_WITH_MPI
+              // initialize MPI
+              arbenv::with_mpi guard(argc, argv, false);
+              // assign a unique gpu to this rank if available
+              resources.gpu_id = arbenv::find_private_gpu(MPI_COMM_WORLD);
+              // create a distributed context
+              auto context = arb::make_context(resources, MPI_COMM_WORLD);
+              root = arb::rank(context) == 0;
+      #else
+              resources.gpu_id = arbenv::default_gpu();
+              // create a local context
+              auto context = arb::make_context(resources);
+      #endif
+              // Print a banner with information about hardware configuration
+              std::cout << "gpu:      " << (has_gpu(context)? "yes": "no") << "\n";
+              std::cout << "threads:  " << num_threads(context) << "\n";
+              std::cout << "mpi:      " << (has_mpi(context)? "yes": "no") << "\n";
+              std::cout << "ranks:    " << num_ranks(context) << "\n" << std::endl;
+              // run some simulations!
+          }
+          catch (std::exception& e) {
+              std::cerr << "exception caught in ring miniapp: " << e.what() << "\n";
+              return 1;
+          }
+          return 0;
+      }
--- a/doc/index.rst
+++ b/doc/index.rst
 Arbor
 =====
-.. image:: https://travis-ci.org/eth-cscs/arbor.svg?branch=master
+.. image:: https://travis-ci.org/arbor-sim/arbor.svg?branch=master
-    :target: https://travis-ci.org/eth-cscs/arbor
+    :target: https://travis-ci.org/arbor-sim/arbor
 What is Arbor?
 --------------
@@ -26,10 +26,11 @@ Arbor is designed from the ground up for **many core**  architectures:
 Features
 --------
-We are actively developing `Arbor <https://github.com/eth-cscs/arbor>`_, improving performance and adding features.
+We are actively developing `Arbor <https://github.com/arbor-sim/arbor>`_, improving performance and adding features.
 Some key features include:
-    * Optimized back ends for CUDA, KNL and AVX2 intrinsics.
+    * Optimized back end for CUDA
+    * Optimized vector back ends for Intel (KNL, AVX, AVX2) and Arm (ARMv8-A NEON) intrinsics.
    * Asynchronous spike exchange that overlaps compute and communication.
    * Efficient sampling of voltage and current on all back ends.
    * Efficient implementation of all features on GPU.
@@ -47,6 +48,7 @@ Some key features include:
   model_intro
   model_common
+   model_hardware
   model_recipe
   model_domdec
   model_simulation
@@ -59,6 +61,7 @@ Some key features include:
   cpp_intro
   cpp_common
+   cpp_hardware
   cpp_recipe
   cpp_domdec
   cpp_simulation

--- a/doc/install.rst
+++ b/doc/install.rst
@@ -299,12 +299,12 @@ and `ARM options <https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html>`_.
     cmake -DARB_ARCH=skylake-avx512   # skylake with avx512 (Xeon server)
     cmake -DARB_ARCH=knl              # Xeon Phi KNL
+     # ARM Arm8a
+     cmake -DARB_ARCH=armv8-a
     # IBM Power8
     cmake -DARB_ARCH=power8
-     # IBM Arm8a
-     cmake -DARB_ARCH=armv8-a
 ..  _vectorize:
 Vectorization
@@ -321,7 +321,7 @@ for the architecture, enabling ``ARB_VECTORIZE`` will lead to a compilation erro
 With this flag set, the library will use architecture-specific vectorization intrinsics
 to implement these kernels. Arbor currently has vectorization support for x86 architectures
-with AVX, AVX2 or AVX512 ISA extensions.
+with AVX, AVX2 or AVX512 ISA extensions, and for ARM architectures with support for AArch64 NEON intrinsincs (first available on ARMv8-A).
 .. _gpu:

--- a/doc/model_domdec.rst
+++ b/doc/model_domdec.rst
@@ -18,17 +18,3 @@ A *load balancer* generates the domain decomposition using the
 model recipe and a description of the available computational resources on which the model will run described by an execution context.
 Currently Arbor provides one load balancer and more will be added over time.
-Hardware
--------
-*Local resources* are locally available computational resources, specifically the number of hardware threads and the number of GPUs.
-An *allocation* enumerates the computational resources to be used for a simulation, typically a subset of the resources available on a physical hardware node.
-Execution Context
-----------------
-An *execution context* contains the local thread pool, and optionally the GPU state and MPI communicator, if available. Users of the library configure contexts, which are passed to Arbor methods and types.
-See :ref:`cppdomdec` for documentation of the C++ interface for domain decomposition.
--- a/doc/model_hardware.rst
+++ b/doc/model_hardware.rst
+.. _modelhardware:
+Hardware
+========
+*Local resources* are locally available computational resources, specifically the number of hardware threads and the number of GPUs.
+An *allocation* enumerates the computational resources to be used for a simulation, typically a subset of the resources available on a physical hardware node.
+.. Note::
+   New users can find using contexts a little verbose.
+   The design is very deliberate, to allow fine-grained control over which
+   computational resources an Arbor simulation should use.
+   As a result Arbor is much easier to integrate into workflows that
+   run multiple applications or libraries on the same node, because
+   Arbor has a direct API for using on node resources (threads and GPU)
+   and distributed resources (MPI) that have been partitioned between
+   applications/libraries.
+Execution Context
+-----------------
+An *execution context* contains the local thread pool, and optionally the GPU state and MPI communicator, if available. Users of the library configure contexts, which are passed to Arbor methods and types.
+See :ref:`cppdomdec` for documentation of the C++ interface for domain decomposition.