diff --git a/README.md b/README.md index dc2e7d720fa5452c15f2db6b2e7ad6001928cb85..f7e1dda77ddc7979d2b323ad2444bfddac4db599 100644 --- a/README.md +++ b/README.md @@ -77,3 +77,143 @@ cmake <path to CMakeLists.txt> -DWITH_TBB=ON -DSYSTEM_CRAY=ON cmake <path to CMakeLists.txt> -DWITH_TBB=ON -DWITH_MPI=ON -DSYSTEM_CRAY=ON ``` + +# targetting KNL + +## build modparser + +The source to source compiler "modparser" that generates the C++/CUDA kernels for the ion channels and synapses is in a separate repository. +It is included in our project as a git submodule, and by default it will be built with the same compiler and flags that are used to build the miniapp and tests. + +This can cause problems if we are cross compiling, e.g. for KNL, because the modparser compiler might not be runnable on the compilation node. +CMake will look for the source to source compiler executable, `modcc`, in the `PATH` environment variable, and will use the version if finds instead of building its own. + +Modparser requires a C++11 compiler, and has been tested on GCC, Intel, and Clang compilers + - if the default compiler on your is some ancient version of gcc you might need to load a module/set the CC and CXX environment variables. + +```bash +git clone git@github.com:eth-cscs/modparser.git +cd modparser + +# example of setting a C++11 compiler +export CXX=`which gcc-4.8` + +cmake . +make -j + +# set path and test that you can see modcc +export PATH=`pwd`/bin:$PATH +which modcc +``` + +## set up environment + +- source the intel compilers +- source the TBB vars +- I have only tested with the latest stable version from online, not the version that comes installed sometimes with the Intel compilers. + +## build miniapp + +```bash +# clone the repo and set up the submodules +git clone TODO +cd cell_algorithms +git submodule init +git submodule update + +# make a path for out of source build +mkdir build_knl +cd build_knl + +## build miniapp + +```bash +# clone the repo and set up the submodules +git clone TODO +cd cell_algorithms +git submodule init +git submodule update + +# make a path for out of source build +mkdir build_knl +cd build_knl + +# run cmake with all the magic flags +export CC=`which icc` +export CXX=`which icpc` +cmake .. -DCMAKE_BUILD_TYPE=release -DWITH_TBB=ON -DWITH_PROFILING=ON -DVECTORIZE_TARGET=KNL -DUSE_OPTIMIZED_KERNELS=ON +make -j +``` + +The flags passed into cmake are described: + - `-DCMAKE_BUILD_TYPE=release` : build in release mode with `-O3`. + - `-WITH_TBB=ON` : use TBB for threading on multicore + - `-DWITH_PROFILING=ON` : use internal profilers that print profiling report at end + - `-DVECTORIZE_TARGET=KNL` : generate AVX512 instructions, alternatively you can use: + - `AVX2` for Haswell & Broadwell + - `AVX` for Sandy Bridge and Ivy Bridge + - `-DUSE_OPTIMIZED_KERNELS=ON` : tell the source to source compiler to generate optimized kernels that use Intel extensions + - without these vectorized code will not be generated. + +## run tests + +Run some unit tests +```bash +cd tests +./test.exe +cd .. +``` + +## run miniapp + +The miniapp is the target for benchmarking. +First, we can run a small problem to check the build. +For the small test run, the parameters have the following meaning + - `-n 1000` : 1000 cells + - `-s 200` : 200 synapses per cell + - `-t 20` : simulated for 20ms + - `-p 0` : no file output of voltage traces + +The number of cells is the number of discrete tasks that are distributed to the threads in each large time integration period. +The number of synapses per cell is the amount of computational work per cell/task. +Realistic cells have anywhere in the range of 1,000-10,000 synapses per cell. + +```bash +cd miniapp + +# a small run to check that everything works +./miniapp.exe -n 1000 -s 200 -t 20 -p 0 + +# a larger run for generating meaninful benchmarks +./miniapp.exe -n 2000 -s 2000 -t 100 -p 0 +``` + +This generates the following profiler output (some reformatting to make the table work): + +``` + --------------------------------------- + | small | large | +-------------|-------------------|-------------------| +total | 0.791 100.0 | 38.593 100.0 | + stepping | 0.738 93.3 | 36.978 95.8 | + matrix | 0.406 51.3 | 6.034 15.6 | + solve | 0.308 38.9 | 4.534 11.7 | + setup | 0.082 10.4 | 1.260 3.3 | + other | 0.016 2.0 | 0.240 0.6 | + state | 0.194 24.5 | 23.235 60.2 | + expsyn | 0.158 20.0 | 22.679 58.8 | + hh | 0.014 1.7 | 0.215 0.6 | + pas | 0.003 0.4 | 0.053 0.1 | + other | 0.019 2.4 | 0.287 0.7 | + current | 0.107 13.5 | 7.106 18.4 | + expsyn | 0.047 5.9 | 6.118 15.9 | + pas | 0.028 3.5 | 0.476 1.2 | + hh | 0.006 0.7 | 0.096 0.2 | + other | 0.026 3.3 | 0.415 1.1 | + events | 0.005 0.6 | 0.125 0.3 | + sampling | 0.003 0.4 | 0.051 0.1 | + other | 0.024 3.0 | 0.428 1.1 | + other | 0.053 6.7 | 1.614 4.2 | +----------------------------------------------------- +``` +