Brian2CUDA documentation
Introduction
Brian2CUDA is a Python package for simulating spiking neural networks on graphics processing units (GPUs). It is an extension of the spiking neural network simulator Brian2, which allows flexible model definitions in Python. Brian2CUDA uses the code generation system from Brian2 to generate simulation code in C++/CUDA, which is then executed on NVIDIA GPUs.
To use Brian2CUDA, add the following two lines of code to your Brian2 imports. This will execute your simulations on a GPU:
from brian2 import *
import brian2cuda
set_device("cuda_standalone")
For more details on the code generation process and settings, read the Brian2 standalone device documentation.
Getting help and reporting bugs
If you need help with Brian2CUDA, please use the Brian2 discourse forum. If you think you found a bug, please report it in our issue tracker on GitHub.
Citing Brian2CUDA
If you use Brian2CUDA in your work, please cite:
Alevi, D., Stimberg, M., Sprekeler, H., Obermayer, K., & Augustin, M. (2022). Brian2CUDA: flexible and efficient simulation of spiking neural network models on GPUs. Frontiers in Neuroinformatics. https://doi.org/10.3389/fninf.2022.883700
Installation
Requirements
Linux operating system (support for Windows is planned, see https://github.com/brian-team/brian2cuda/issues/225)
NVIDIA CUDA GPU with compute capability 3.5 or larger
CUDA Toolkit with
nvcc
compilerPython version 3.6 or larger
Brian2: Each Brian2CUDA version is compatible with a specific Brian2 version. The correct Brian2 version is installed during the Brian2CUDA installation.
We recommend installing Brian2CUDA in a separate Python environment, either using a “virtual environment” or a “conda environment”. If you are unfamiliar with that, check out the Brian2 installation instructions.
Standard install
To install Brian2CUDA with a compatible Brian2 version, use pip
:
python -m pip install brian2cuda
Updating an existing installation
Use the install command together with the --upgrade
option:
python -m pip install --upgrade brian2cuda
This will also update the installed Brian2 version if required.
Development install
When you encounter a problem in BrianCUDA, we will sometimes ask you to install Brian2CUDA’s latest development version, which includes changes that were included after its last release.
We regularly upload the latest development version of Brian2CUDA to PyPI’s test server. You can install it via:
python -m pip install --upgrade --pre -i https://test.pypi.org/simple/ brian2cuda
Note that this requires that you already have a compatible Brian2 version and all of its dependencies installed.
If you have git
installed, you can also install directly from github:
python -m pip install git+https://github.com/brian-team/brian2cuda.git
If you want to either contribute to Brian’s development or regularly test its
latest development version, you can directly clone the git repository at github
(https://github.com/brian-team/brian2cuda) and then run pip install -e
/path/to/brian2cuda
, to install Brian2CUDA in “development mode”. As long as
the compatible Brian2 version doesn’t change, updating the git repository is in
general enough to keep up with changes in the code, i.e. it is not necessary to
install it again. If the compatible Brian2 versions changes though, you need to
manually update Brian2.
Testing your installation
Brian2CUDA tries to automatically detect your CUDA toolkit installation and choose the newest GPU on your system to run simulations. To test if this detection and your installation were successful, you can run this test simulation:
import brian2cuda
brian2cuda.example_run()
If the automatic CUDA and GPU detection fails or you want to manually change it, read Configuring the CUDA backend.
Running the Brian2CUDA test suit
If you have the pytest testing utility installed, you can run Brian2CUDA’s test suite:
import brian2cuda
brian2cuda.test()
This runs all standalone-comatible tests from the Brian2 test suite and additional Brian2CUDA tests (see the Brian2 developer documentation on testing for more details) and can take 1-2 hours, depending on your hardware. The test suite should end with “OK”, showing a number of skipped tests but no errors or failures. If you want to run individual tests instead of the entire test suite (e.g. during development), check out the Brian2CUDA tools directory.
Configuring the CUDA backend
Brian2CUDA tries to detect your CUDA installation and uses the GPU with highest
compute capability by default. To query information about available GPUs,
nvidia-smi
(installed alongside NVIDIA display drivers) is used.
For older driver versions (< 510.39.01
), nvidia-smi
doesn’t support querying the
GPU compute capabilities and some additional setup might be required.
This section explains how you can manually set which CUDA installation or GPU to use, how to cross-compile Brian2CUDA projects on systems without GPU access (e.g. during remote development) and what to do when the compute capability detection fails.
Manually specifying the CUDA installation
If you installed the CUDA toolkit in a non-standard location or if you have a system with multiple CUDA installations, you may need to manually specify the installation directory.
Brian2CUDA tries to detect your CUDA installation in the following order:
Use Brian2CUDA preference devices.cuda_standalone.cuda_backend.cuda_path
Use
CUDA_PATH
environment variableUse location of
nvcc
to detect CUDA installation folder (needsnvcc
binary inPATH
)Use standard location
/usr/local/cuda
Use standard location
/opt/cuda
If you set the path manually via the 1. or 2. option, specify the parent path
to the nvcc
binary (e.g. /usr/local/cuda
if nvcc
is in
/usr/local/cuda/bin/nvcc
).
Depending on your system configuration, you may also need to set the
LD_LIBRARY_PATH
environment variable to $CUDA_PATH/lib64
.
Manually selecting a GPU to use
On systems with multiple GPUs, Brian2CUDA uses the first GPU with highest compute
capability as returned by nvidia-smi
. If you want to manually choose a GPU you can
do so via Brian2CUDA preference devices.cuda_standalone.cuda_backend.gpu_id.
Note
You can limit the visibility of NVIDIA GPUs by setting the environment variable
CUDA_VISIBLE_DEVICES
. This also limits the GPUs visible to Brian2CUDA. That means
Brian2CUDA’s devices.cuda_standalone.cuda_backend.gpu_id preference will index only
those GPUs that are visible. E.g. if you run a Brian2CUDA script with
prefs.devices.cuda_standalone.cuda_backend.gpu_id = 0
on a system with two GPUs
via CUDA_VISIBLE_DEVICES=1 python your-brian2cuda-script.py
, the simulation would
run on the second GPU (with ID 1
, visible to Brian2CUDA as ID 0
).
Cross-compiling on systems without GPU access
On systems without GPU, Brian2CUDA will fail before code generation by default (since it tries to detect the compute capability of the available GPUs and the CUDA runtime version). If you want to compile your code on a system without GPUs, you can disable automatic GPU detection and manually set the compute capability and runtime version. To do so, set the following preferences:
prefs.devices.cuda_standalone.cuda_backend.detect_gpus = False
prefs.devices.cuda_standalone.cuda_backend.compute_capability = <compute_capability>
prefs.devices.cuda_standalone.cuda_backend.runtime_version = <runtime_version>
See devices.cuda_standalone.cuda_backend.detect_gpus, devices.cuda_standalone.cuda_backend.compute_capability and devices.cuda_standalone.cuda_backend.cuda_runtime_version.
Detecting GPU compute capability on systems with outdated NVIDIA drivers
We use nvidia-smi
to query the compute capability of GPUs during automatic GPU
selection. On older driver versions (< 510.39.01
, these are driver versions shipped
with CUDA toolkit < 11.6
), this was not supported. For those versions, we use the
deviceQuery
tool from the CUDA samples, which is by default installed with the
CUDA Toolkit under extras/demo_suite/deviceQuery
in the CUDA installation directory.
For some custom CUDA installations, the CUDA samples are not included, in which case
Brian2CUDA’s GPU detection fails. In that case, you have three options. Do one of the
following:
Update your NVIDIA driver
Download the CUDA samples to a folder of your choice and compile
deviceQuery
manually:git clone https://github.com/NVIDIA/cuda-samples.git cd cuda-samples/Samples/1_Utilities/deviceQuery make # Run deviceQuery to test it ./deviceQuery
Now set Brian2CUDA preference devices.cuda_standalone.cuda_backend.device_query_path to point to your
deviceQuery
binary.Disable automatic GPU detection and manually provide the GPU ID and compute capability (you can find the compute capability of your GPU on https://developer.nvidia.com/cuda-gpus):
prefs.devices.cuda_standalone.cuda_backend.detect_gpus = False prefs.devices.cuda_standalone.cuda_backend.compute_capability = <compute_capability>
See devices.cuda_standalone.cuda_backend.detect_gpus and devices.cuda_standalone.cuda_backend.compute_capability.
Known issues
In addition to the issues noted below, you can refer to our bug tracker on GitHub.
List of known issues:
Known issues when using multiple run
calls
Changing the integration time step of Synapses
with delays between run
calls
Changing the integration time step of Synapses
objects with transmission
delays between successive run
calls currently leads to the loss of spikes.
This is the case for spikes that are queued for effect application but haven’t
been applied yet when the first run
call terminates. See Brian2CUDA issue
#136 for progress on this issue.
Changing delays between run
calls
Changing the delay of Synapses
objects between run
calls currently
leads to the loss of spikes. This is the case when changing homogenenous delays
or when switching between homogeneous and heterogeneous delays (e.g.
Synapses.delay = 'j*ms'
before the first run
call and Synapses.delay
= '1*ms'
after the first run
call). Changing heterogenenous delays
between run
calls is not effected from this bug and should work as expected
(e.g. from synapses.delay = 'j*ms'
to synapses.delay = '2*j*ms'
).
See Brian2CUDA issue #302 for progress on this issue.
Using a different integration time for Synapses
and its source NeuronGroup
There is currently a bug when using Synapses
with homogeneous delays and
choosing a different integration time step (dt
) for any of its
SynapticPathway
and its associated source NeuronGroup
. This bug does
not occur when the delays are heterogenenous or when only the target
NeuronGroup
has a different clock. See Brian2CUDA issue #222 for
progress on the issue. Any of the following examples has this bug:
from brian2 import *
group_different_dt = NeuronGroup(1, 'v:1', threshold='True', dt=2*defaultclock.dt)
group_same_dt = NeuronGroup(1, 'v:1', threshold='True', dt=defaultclock.dt)
# Bug: Source of pre->post synaptic pathway uses different dt than synapses
# and synapses have homogeneous delays
synapses = Synapses(
group_different_dt,
group_same_dt,
on_pre='v+=1',
delay=1*ms,
dt=defaultclock.dt
)
# No bug: Synapses have no delays
synapses = Synapses(
group_different_dt,
group_same_dt,
on_pre='v+=1',
dt=defaultclock.dt
)
# No bug: Synapses have heterogeneous delays
synapses = Synapses(
group_different_dt,
group_same_dt,
on_pre='v+=1',
dt=defaultclock.dt
)
synapses.delay = 'j*ms'
# No bug: Source of pre->post synaptic pathway uses the same dt as synapses
synapses = Synapses(
group_same_dt,
group_different_dt,
on_post='v+=1',
delay=1*ms,
dt=defaultclock.dt
)
SpikeMonitor
and EventMonitor
data is not sorted by indices
In all Brian2 devices, SpikeMonitor
and EventMonitor
data is first
sorted by time and then by neuron index. In Brian2CUDA, the data is only sorted
by time but not always by index given a fixed time point. See Brian2CUDA issue
#46 for progress on this issue.
Single precision mode fails when using variable names with double digit and dot or scientific notations in name
In single precision mode (set via prefs.core.default_float_dtype
),
Brian2CUDA replaces floating point literals like .2
, 1.
or .4
in generated code with
single precision versions 1.2f
, 1.f
and .4f
. Under some
circumstances, the search/replace algorithm fails and performs a wrong string
replacement. This is the case e.g. for variable name with double digit and a
dot in their name, such as variable12.attribute
or when variable names have
a substring that can be interpreted as a scientific number, e.g.
variable28e2
, which has 28e2
as substring. If such a wrong replacement
occurs, compilation typically fails due to not declared variables. See
Brian2CUDA issue #254 for progress on the issue.
Brian2CUDA specific preferences
For information on the Brian2 preference system, read Brian2 preference documentation. The following Brian2CUDA preferences are used in the same way.
List of preferences
Brian2CUDA preferences
devices.cuda_standalone.SM_multiplier
=1
The number of blocks per SM. By default, this value is set to 1.
devices.cuda_standalone.bundle_threads_warp_multiple
=False
Whether to round the number of threads used per synapse bundle during effect application (see devices.cuda_standalone.threads_per_synapse_bundle) to a multiple of the warp size. Round to next multiple if preference is
'up'
, round to previous multiple if'low'
and don’t round at all ifFalse
(default). If rounding down results in0
threads,1
thread is used instead.
devices.cuda_standalone.calc_occupancy
=True
Wether or not to use cuda occupancy api to choose num_threads and num_blocks.
devices.cuda_standalone.default_functions_integral_convertion
=float64
The floating point precision to which integral types will be converted when passed as arguments to default functions that have no integral type overload in device code (sin, cos, tan, sinh, cosh, tanh, exp, log, log10, sqrt, ceil, floor, arcsin, arccos, arctan).” NOTE: Convertion from 32bit and 64bit integral types to single precision (32bit) floating-point types is not type safe. And convertion from 64bit integral types to double precision (64bit) floating-point types neither. In those cases the closest higher or lower (implementation defined) representable value will be selected.
devices.cuda_standalone.extra_threshold_kernel
=True
Wether or not to use a extra threshold kernel for resetting.
devices.cuda_standalone.launch_bounds
=False
Wether or not to use
__launch_bounds__
to optimise register usage in kernels.
devices.cuda_standalone.no_post_references
=False
Set this preference if you don’t need access to
j
in any synaptic code string and no Synapses object applies effects to postsynaptic variables. This preference is for memory optimization until unnecassary device memory allocations in synapse creation are fixed, it is only relevant if your network uses close to all memory.
devices.cuda_standalone.no_pre_references
=False
Set this preference if you don’t need access to
i
in any synaptic code string and no Synapses object applies effects to presynaptic variables. This preference is for memory optimization until unnecassary device memory allocations in synapse creation are fixed, it is only relevant if your network uses close to all memory.
devices.cuda_standalone.parallel_blocks
=1
The total number of parallel blocks to use. If
None
, the number of parallel blocks equals the number streaming multiprocessors on the GPU.
devices.cuda_standalone.profile_statemonitor_copy_to_host
=None
Profile the final device to host copy of StateMonitor data. This preference is used for benchmarking and assumes that there is only one active StateMonitor in the network. The parameter of this preference is the recorded variable for which the device to host copy is recorded (e.g. ‘v’).
devices.cuda_standalone.push_synapse_bundles
=True
If True, synaptic events are propagated by pushing bundles of synapse IDs with same delays into the corresponding delay queue. If False, each synapse of a spiking neuron is pushed in the corresponding queue individually. For very small bundle sizes (number of synapses with same delay, connected to a single neuron), pushing single Synapses can be faster. This option only has effect for
Synapses
objects with heterogenous delays.
devices.cuda_standalone.random_number_generator_ordering
=False
The ordering parameter (str) used to choose how the results of cuRAND random number generation are ordered in global memory. See cuRAND documentation for more details on generator types and orderings.
devices.cuda_standalone.random_number_generator_type
='CURAND_RNG_PSEUDO_DEFAULT'
Generator type (str) that cuRAND uses for random number generation. Setting the generator type automatically resets the generator ordering (prefs.devices.cuda_standalone.random_number_generator_ordering) to its default value. See cuRAND documentation for more details on generator types and orderings.
devices.cuda_standalone.syn_launch_bounds
=False
Wether or not to use
__launch_bounds__
in synapses and synapses_push to optimise register usage in kernels.
devices.cuda_standalone.threads_per_synapse_bundle
='{max}'
The number of threads used per synapses bundle during effect application. This has to be a string, which can be passed to Python’s
eval
function. The string can can use{mean}
,{std}
,{max}
and{min}
expressions, which refer to the statistics across all bundles, and the function ‘ceil’. The result of this expression will be converted to the next lowerint
(e.g.1.9
will be cast to1.0
). Examples:'{mean} + 2 * {std}'
will use the mean bunde size + 2 times the standard deviation over bundle sizes and round it to the next lower integer. If you want to round up instead, use'ceil({mean} + 2 * {std})'
.
devices.cuda_standalone.use_atomics
=True
Weather to try to use atomic operations for synaptic effect application. Since this avoids race conditions, effect application can be parallelised.
Preferences for the CUDA backend in Brian2CUDA
devices.cuda_standalone.cuda_backend.compute_capability
=None
Manually set the compute capability for which CUDA code will be compiled. Has to be a float (e.g.
6.1
) or None. If None, compute capability is chosen depending on GPU in use.
devices.cuda_standalone.cuda_backend.cuda_path
=None
The path to the CUDA installation. If set, this preferences takes precedence over environment variable
CUDA_PATH
.
devices.cuda_standalone.cuda_backend.cuda_runtime_version
=None
The CUDA runtime version.
devices.cuda_standalone.cuda_backend.detect_cuda
=True
Whether to try to detect CUDA installation paths and version. Disable this if you want to generae CUDA standalone code on a system without CUDA installed.
devices.cuda_standalone.cuda_backend.detect_gpus
=True
Whether to detect names and compute capabilities of all available GPUs. This needs access to
nvidia-smi
anddeviceQuery
binaries.
devices.cuda_standalone.cuda_backend.device_query_path
=None
Path to CUDA’s deviceQuery binary. Used to detect a GPUs compute capability
devices.cuda_standalone.cuda_backend.extra_compile_args_nvcc
=['-w', '-use_fast_math']
Extra compile arguments (a list of strings) to pass to the nvcc compiler.
devices.cuda_standalone.cuda_backend.gpu_heap_size
=128
Size of the heap (in MB) used by malloc() and free() device system calls, which are used in the
cudaVector
implementation.cudaVectors
are used to dynamically allocate device memory forSpikeMonitors
and the synapse queues in theCudaSpikeQueue
implementation for networks with heterogeneously distributed delays.
devices.cuda_standalone.cuda_backend.gpu_id
=None
The ID of the GPU that should be used for code execution. Default value is
None
, in which case the GPU with the highest compute capability and lowest ID is used.If environment variable
CUDA_VISIBLE_DEVICES
is set, this preference will be interpreted as ID from the visible devices (e.g. withCUDA_VISIBLE_DEVICES=2
andgpu_id=0
preference, the GPU 2 will be used).
Performance considerations
Check out our Brian2CUDA paper for performance benchmarks and discussions.
If you have performance questions or want to share your experience with Brian2CUDA performance, feel free to post on the Brian2 discourse forum.
brian2cuda package
Package implementing the CUDA “standalone” Device
and CodeObject
.
Functions
|
Run a simple example simulation to test whether Brian2CUDA is correctly set up. |
_version
module
binomial
module
CUDA implementation of BinomialFunction
codeobject
module
Module implementing the CUDA “standalone” CodeObject
. Brian2CUDA implements
two different code objects. CUDAStandaloneCodeObject
is the standard
implementation, which does not use atomic operations but serialized synaptic
effect application if race conditions are possible.
CUDAStandaloneAtomicsCodeObject
uses atomic operations which allows parallel
effect applications even when race conditions are possible.
Exported members:
CUDAStandaloneCodeObject
, CUDAStandaloneAtomicsCodeObject
Classes
|
CUDA standalone code object which uses atomic operations for parallel execution |
|
CUDA standalone code object |
cuda_generator
module
Exported members:
CUDACodeGenerator
, CUDAAtomicsCodeGenerator
, c_data_type
Classes
|
|
C++ language with CUDA library |
cuda_prefs
module
Preferences that relate to the brian2cuda interface.
Functions
|
device
module
Module implementing the CUDA “standalone” device.
Classes
The |
|
Methods |
Functions
|
Prepare a CodeObject for random number generation (RNG). |
Objects
The |
timedarray
module
CUDA implementation of TimedArray
Subpackages
utils package
Utility functions for Brian2CUDA
gputools
module
Tools to get information about available GPUs.
Functions
Return list of names of available GPUs, sorted by GPU ID as reported in |
Get the "best" GPU available. |
|
Get compute capability of GPU with ID |
Return new dictionary of cuda installation variables |
Detect the path to the CUDA installation (e.g. |
Return CUDA runtime version (as float, e.g. |
Return dictionary of selected gpu variable |
Return the path to the |
Reset detected CUDA installation. |
Reset selected GPU. |
|
Set global cuda installation dictionary to |
|
Set global gpu selection dictionary to |
Select GPU for simulation, based on user preference |
logger
module
Brian2CUDA’s logging system extensions
Exported members:
suppress_brian2_logs
Functions
Suppress all logs coming from brian2. |
stringtools
module
Brian2CUDA regex functions.
Functions
|
Append |
Replace double-precision floating-point literals in |