diff --git a/doc/simd_api.rst b/doc/simd_api.rst
index 589f53cd826f4b198f600a88405c24249947e862..f8c0ce37bbdc1bf575727d2006f01b6089240f92 100644
--- a/doc/simd_api.rst
+++ b/doc/simd_api.rst
@@ -74,6 +74,10 @@ Three user-facing template classes are provided:
 
    The result of a ``where`` expression, used for masked assignment.
 
+There is, in addition, a templated class ``simd_detail::indirect_expression``
+that holds the result of an `indirect(...)` expression. These arise in
+gather and scatter operations, and are detailed below.
+
 Implementation typemaps live in the ``simd_abi`` namespace, while concrete
 implementation classes live in ``simd_detail``. A particular specialization
 for an architecture, for example 4-wide double on AVX, then requires:
@@ -89,6 +93,34 @@ for an architecture, for example 4-wide double on AVX, then requires:
 The maximum natively supported width for a scalar type *V* is recorded in
 ``simd_abi::native_width<V>::value``.
 
+Indirect expressions
+^^^^^^^^^^^^^^^^^^^^
+
+An expression of the form ``indirect(p, k)`` or ``indirect(p, k, constraint)`` describes
+a sequence of memory locations based at the pointer *p* with offsets given by the
+``simd`` variable *k*. A constraint of type ``index_constraint`` can be provided, which
+promises certain guarantees on the index values in *k*:
+
+.. list-table::
+    :widths: 20 80
+    :header-rows: 1
+
+    * - Constraint
+      - Guarantee
+
+    * - ``index_constraint::none``
+      - No restrictions.
+
+    * - ``index_constraint::independent``
+      - No indices are repeated, i.e. *k*\ `i`:sub: = *k*\ `j`:sub: implies *i* = *j*.
+
+    * - ``index_constraint::contiguous``
+      - Indices are sequential, i.e. *k*\ `i`:sub: = *k*\ `0`:sub: + *i*.
+
+    * - ``index_constraint::constant``
+      - Indices are all equal, i.e. *k*\ `i`:sub: = *k*\ `j`:sub: for all *i* and *j*.
+
+
 Class ``simd``
 ^^^^^^^^^^^^^^
 
@@ -99,9 +131,10 @@ arithmetic operators for a concrete implementation class `C`.
 In the following:
 
 * *S* stands for the class ``simd<V, N, I>``.
-* *s* is an SIMD value of type *S*.
+* *s* is a SIMD value of type *S*.
 * *m* is a mask value of type ``S::simd_mask``.
 * *t*, *u* and *v* are const objects of type *S*.
+* *w* is a SIMD value of type ``simd<W, N, J>``.
 * *i* is an index of type ``int``.
 * *j* is a const object of type ``simd<U, N, J>`` where *U* is an integral type.
 * *x* is a value of type *V*.
@@ -152,6 +185,12 @@ Here and below, the value in lane *i* of a SIMD vector or mask *v* is denoted by
     * - ``S(c)``
       - A SIMD value *v* with *v*\ `i`:sub: equal to ``c[i]`` for *i* = 0â€¦*N*-1.
 
+    * - ``S(w)``
+      - A copy or value-cast of the SIMD value *w* of a different type but same width.
+
+    * - ``S(indirect(p, j))``
+      - A SIMD value *v* with *v*\ `i`:sub: equal to ``p[j[i]]`` for *i* = 0â€¦*N*-1.
+
     * - ``S(c, m)``
       - A SIMD value *v* with *v*\ `i`:sub: equal to ``c[i]`` for *i* where *m*\ `i`:sub: is true.
 
@@ -169,7 +208,7 @@ Here and below, the value in lane *i* of a SIMD vector or mask *v* is denoted by
       - ``void``
       - Set ``p[i]`` to *t*\ `i`:sub: for *i* = 0â€¦*N*-1.
 
-    * - ``t.scatter(p, j)``
+    * - ``t.copy_to(indirect(p, j))``
       - ``void``
       - Set ``p[j[i]]`` to *t*\ `i`:sub: for *i* = 0â€¦*N*-1.
 
@@ -177,10 +216,14 @@ Here and below, the value in lane *i* of a SIMD vector or mask *v* is denoted by
       - ``void``
       - Set *s*\ `i`:sub: to ``c[i]`` for *i* = 0â€¦*N*-1.
 
-    * - ``s.gather(c, j)``
+    * - ``s.copy_from(indirect(c, j))``
       - ``void``
       - Set *s*\ `i`:sub: to ``c[j[i]]`` for *i* = 0â€¦*N*-1.
 
+    * - ``s.sum()``
+      - ``V``
+      - Sum of *s*\ `i`:sub: for *i* = 0â€¦*N*-1.
+
 .. rubric:: Expressions
 
 .. list-table::
@@ -259,6 +302,18 @@ Here and below, the value in lane *i* of a SIMD vector or mask *v* is denoted by
       - ``S&``
       - Equivalent to ``s=S(x)``.
 
+    * - ``indirect(p, j)=t``
+      - ``decltype(indirect(p, j))&``
+      - Equivalent to ``t.copy_to(indirect(p, j))``.
+
+    * - ``indirect(p, j)+=t``
+      - ``decltype(indirect(p, j))&``
+      - Compound indirect assignment: ``p[j[i]]+=t[i]`` for *i* = 0â€¦*N*-1.
+
+    * - ``indirect(p, j)-=t``
+      - ``decltype(indirect(p, j))&``
+      - Compound indirect assignment: ``p[j[i]]-=t[i]`` for *i* = 0â€¦*N*-1.
+
     * - ``t[i]``
       - ``V``
       - Value *t*\ `i`:sub:
@@ -291,7 +346,7 @@ In the following:
 * *m* and *q* are const objects of type ``simd_mask<V, N, I>``.
 * *u* is an object of type ``simd_mask<V, N, I>``.
 * *b* is a boolean value.
-* *w* is a pointer to ``bool``.
+* *q* is a pointer to ``bool``.
 * *y* is a const pointer to ``bool`` or a length *N* array of ``bool``.
 * *i* is of type ``int``.
 * *k* is of type ``unsigned long long``.
@@ -326,9 +381,9 @@ Note that ``simd_mask`` does not (currently) offer a masked pointer/array constr
       - Type
       - Description
 
-    * - ``m.copy_to(w)``
+    * - ``m.copy_to(q)``
       - ``void``
-      - Write the boolean value *m*\ `i`:sub: to ``w[i]`` for *i* = 0â€¦*N*-1.
+      - Write the boolean value *m*\ `i`:sub: to ``q[i]`` for *i* = 0â€¦*N*-1.
 
     * - ``u.copy_from(y)``
       - ``void``
@@ -435,7 +490,7 @@ In the following:
       - ``void``
       - Set ``p[i]`` to *s*\ `i`:sub: for *i* where *m*\ `i`:sub: is true.
 
-    * - ``where(m, s).scatter(p, j)``
+    * - ``where(m, s).copy_to(indirect(p, j))``
       - ``void``
       - Set ``p[j[i]]`` to *s*\ `i`:sub: for *i* where *m*\ `i`:sub: is true.
 
@@ -443,7 +498,7 @@ In the following:
       - ``void``
       - Set *s*\ `i`:sub: to ``c[i]`` for *i* where *m*\ `i`:sub: is true.
 
-    * - ``where(m, s).gather(c, j)``
+    * - ``where(m, s).copy_from(indirect(c, j))``
       - ``void``
       - Set *s*\ `i`:sub: to ``c[j[i]]`` for *i* where *m*\ `i`:sub: is true.
 
@@ -461,10 +516,13 @@ refer to the `vector transcendental functions documentation <simd_maths_>`_ for
 
 In the following:
 
+* *I* and *J* are SIMD implementations.
 * *A* is a SIMD class ``simd<K, N, I>`` for some scalar type *K*.
 * *S* is a SIMD class ``simd<V, N, I>`` for a floating point type *V*.
+* *L* is a scalar type implicitly convertible from *K*.
 * *a* and *b* are values of type *A*.
 * *s* and *t* are values of type *S*.
+* *r* is a value of type ``std::array<K, N>``.
 
 .. list-table::
     :widths: 20 20 60
@@ -514,6 +572,18 @@ In the following:
       - *S*
       - Lane-wise raise *s* to the power of *t*.
 
+    * - ``simd_cast<std::array<L, N>>(a)``
+      - ``std::array<L, N>``
+      - Lane-wise cast of values in *a* to scalar type *L* in ``std::array<L, N>``.
+
+    * - ``simd_cast<simd<L, N, J>>(a)``
+      - ``simd<L, N, J>``
+      - Lane-wise cast of values in *a* to scalar type *L* in ``simd<L, N, J>``.
+
+    * - ``simd_cast<simd<L, N, J>>(r)``
+      - ``simd<L, N, J>``
+      - Lane-wise cast of values in the ``std::array<K, N>`` value *r* to scalar type *L* in ``simd<L, N, J>``.
+
 
 Implementation requirements
 ---------------------------
@@ -529,6 +599,9 @@ functionality. The base class ``implbase<C>`` in turn relies upon
 
 All the required SIMD operations are given by static member functions of *C*.
 
+Some arguments to static member functions use a tag class (``simd_detail::tag``)
+parameterized on a concrete implementation class for dispatch purposes.
+
 Minimal implementation
 ^^^^^^^^^^^^^^^^^^^^^^
 
@@ -605,12 +678,15 @@ a SIMD class of width *N* and value type *V*.
 * *p* is a pointer of type ``C::scalar_type*``.
 * *j* is a SIMD index representation of type ``J::vector_type`` for
   an integral concrete implementation class *J*.
+* *d* is a SIMD representation of type ``D::vector_type`` for
+  a (different) concrete implementation class *D*.
 * *b* is a ``bool`` value.
-* *w* is a pointer to ``bool``.
+* *q* is a pointer to ``bool``.
 * *y* is a const pointer to ``bool``.
 * *i* is an unsigned (index) value.
 * *k* is an unsigned long long value.
 * *m* is a mask representation of type ``C::mask_type``.
+* *z* is an ``index_constraint`` value.
 
 .. rubric:: Types and constants
 
@@ -652,6 +728,11 @@ a SIMD class of width *N* and value type *V*.
       - Type
       - Description
 
+    * - ``C::cast_from(tag<W>{}, d)``
+      - ``C::vector_type``
+      - Return a vector with values *v*\ `i`:sub: = *d*\ `i`:sub:, where ``D::scalar_type``
+        is implicitly convertible to ``C::scalar_type``.
+
     * - ``C::broadcast(x)``
       - ``C::vector_type``
       - Fill representation with scalar *x*.
@@ -672,9 +753,9 @@ a SIMD class of width *N* and value type *V*.
       - ``C::vector_type``
       - Return a vector with values *v*\ `i`:sub: loaded from *c+i* wherever *m*\ `i`:sub: is true. *c* may be unaligned.
 
-    * - ``C::copy_from_masked(w, c, m)``
+    * - ``C::copy_from_masked(u, c, m)``
       - ``void``
-      - Return a vector with values *v*\ `i`:sub: loaded from *c+i* wherever *m*\ `i`:sub: is true, or equal to *w*\ `i`:sub
+      - Return a vector with values *v*\ `i`:sub: loaded from *c+i* wherever *m*\ `i`:sub: is true, or equal to *u*\ `i`:sub:
         otherwise. *c* may be unaligned.
 
 .. rubric:: Lane access
@@ -719,22 +800,46 @@ of type *J*, used only to disambiguate overloads.
       - Type
       - Description
 
-    * - ``C::gather(J{}, p, j)``
+    * - ``C::gather(tag<J>{}, p, j)``
       - ``C::vector_type``
       - Vector *v* with values *v*\ `i`:sub: = ``p[j[i]]``.
 
-    * - ``C::gather(J{}, u, p, j, m)``
+    * - ``C::gather(tag<J>{}, u, p, j, m)``
       - ``C::vector_type``
       - Vector *v* with values *v*\ `i`:sub: = *m*\ `i`:sub: ? ``p[j[i]]`` : *u*\ `i`:sub:.
 
-    * - ``C::scatter(J{}, u, p, j)``
+    * - ``C::scatter(tag<J>{}, u, p, j)``
       - ``void``
       - Write values *u*\ `i`:sub: to ``p[j[i]]``.
 
-    * - ``C::scatter(J{}, u, p, j, m)``
+    * - ``C::scatter(tag<J>{}, u, p, j, m)``
       - ``void``
       - Write values *u*\ `i`:sub: to ``p[j[i]]`` for lanes *i* where *m*\ `i`:sub: is true.
 
+    * - ``C::compound_indexed_add(tag<J>{}, u, p, j, z)``
+      - ``void``
+      - Update values ``p[j[i]] += u[i]`` for lanes *i*, subject to constraint *z*.
+
+.. rubric:: Casting
+
+Implementations can provide optimized versions of lane-wise
+value casting from other specific implementation classes.
+
+The first argument is a dummy argument
+of type *J*, used only to disambiguate overloads.
+
+.. list-table::
+    :header-rows: 1
+    :widths: 20 20 60
+
+    * - Expression
+      - Type
+      - Description
+
+    * - ``C::cast_from(tag<J>{}, d)``
+      - ``C::vector_type``
+      - Returns vector *v* with values *v*\ `i`:sub: = *d*\ `i`:sub:, cast from ``D::scalar_type`` to ``C::scalar_type``.
+
 .. rubric:: Arithmetic operations
 
 .. list-table::
@@ -769,6 +874,10 @@ of type *J*, used only to disambiguate overloads.
       - ``C::vector_type``
       - Lane-wise fused multiply-add (u*v+w).
 
+    * - ``C::reduce_add(u)``
+      - ``C::scalar_type``
+      - (Horizontal) sum of values *u*\ `i`:sub: in each lane.
+
 .. rubric:: Comparison and blends
 
 .. list-table::
@@ -885,7 +994,7 @@ SIMD mask class.
       - ``void``
       - Set mask value *u*\ `i`:sub: to *b*.
 
-    * - ``C::mask_copy_to(v, w)``
+    * - ``C::mask_copy_to(v, q)``
       - ``void``
       - Write bool values to memory (unaligned).
 
@@ -935,19 +1044,9 @@ There is no support yet for the following features, although some of these
 will need to be provided in order to improve the efficiency of SIMD versions
 of our generated mechanisms.
 
-* A SIMD cast function, e.g. ``simd_cast<S>(const T&)`` that converts between
-  different SIMD wrappers of the same width. The infrastructure that supports
-  this in concrete implementation classes would also simplify the implementation
-  of more generic ``gather`` and ``scatter`` methods.
-
-* Horizontal reductions across the lanes of a SIMD value or where-expression.
-
+* Contraint-based dispatch for indirect operations other than ``+=`` and ``-=``.
 * Vectorizable implementations of trigonometric functions.
 
-* Compound assignment operations for where-expressions. Extending the concrete
-  implementation API to support this would allow, for example, efficient use
-  of AVX512 masked arithmetic instructions.
-
 .. _simd_maths:
 
 Implementation of vector transcendental functions