Memory-Efficient Reversible Spiking Neural Networks

Hong Zhang¹, Yu Zhang¹,²*

¹State Key Laboratory of Industrial Control Technology, College of Control Science and Engineering, Zhejiang University, Hangzhou, China
²Key Laboratory of Collaborative Sensing and Autonomous Unmanned Systems of Zhejiang Province, Hangzhou, China

{hongzhang99, zhangyu80}@zju.edu.cn

Abstract

Spiking neural networks (SNNs) are potential competitors to artificial neural networks (ANNs) due to their high energy-efficiency on neuromorphic hardware. However, SNNs are unfolded over simulation time steps during the training process. Thus, SNNs require much more memory than ANNs, which impedes the training of deeper SNN models. In this paper, we propose the reversible spiking neural network to reduce the memory cost of intermediate activations and membrane potentials during training. Firstly, we extend the reversible architecture along temporal dimension and propose the reversible spiking block, which can reconstruct the computational graph and recompute all intermediate variables in forward pass with a reverse process. On this basis, we adopt the state-of-the-art SNN models to the reversible variants, namely reversible spiking ResNet (RevSResNet) and reversible spiking transformer (RevSFormer). Through experiments on static and neuromorphic datasets, we demonstrate that the memory cost per image of our reversible SNNs does not increase with the network depth. On CIFAR10 and CIFAR100 datasets, our RevSResNet37 and RevSFormer-4-384 achieve comparable accuracies and consume 3.79× and 3.00× lower GPU memory per image than their counterparts with roughly identical model complexity and parameters. We believe that this work can unleash the memory constraints in SNN training and pave the way for training extremely large and deep SNNs.

Introduction

Spiking neural networks (SNNs), brain-inspired models based on binary spiking signals, are regarded as the third generation of neural networks (Maass 1997). Due to the sparsity and event-driven characteristics, SNNs can be deployed on neuromorphic hardware with low energy consumption. With the help of backpropagation through time framework (BPTT) and surrogate gradient, direct training SNNs are developing towards deeper and larger models. Advanced spiking architectures such as ResNet-like SNNs (Hu et al. 2021; Fang et al. 2021a; Zhang et al. 2023) and spiking vision transformers (Zhou et al. 2022, 2023) have been proposed in succession, indicating that SNNs are potential competitors to artificial neural networks (ANNs).

*Corresponding Author.

Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
neurons need also to be stored for gradient computation. It is evident that a significant amount of memory consumption comes from storing intermediate activations and membrane potentials (Gomez et al. 2017). By reducing this part of the consumption, we can decouple the memory growth from the depth to a large extent.

In this work, we propose reversible spiking neural networks to reduce the memory cost of SNN training. The intention of reversibility is that each layer’s input variables and membrane potentials can be re-computed by its output variables. Therefore, even if no intermediate variables are stored, we can quickly reconstruct them through such reversible transformation. In this work, we first extend the reversible architecture (Gomez et al. 2017) along the temporal dimension to adapt to the BPTT training framework. On this basis, we propose spiking reversible block, which is reversible along spatial dimension and consistent along temporal dimension. Then, we present the reversible spiking ResNet (RevSResNet) and reversible spiking transformer (RevSFormer), which are the reversible counterparts of MS ResNet (Hu et al. 2021) and Spikingformer (Zhou et al. 2022) (the latest ResNet-like and transformer-like SNNs). As is shown in Figure 1, our networks consume much less memory per image than their counterparts. We verify the effect of RevSResNet and RevSFormer on static datasets (CIFAR10 and CIFAR100 (Krizhevsky, Hinton et al. 2009)) and neuromorphic datasets (CIFAR10-DVS (Li et al. 2017) and DVS128 Gesture (Amir et al. 2017)). The experiments show that RevSResNet and RevSFormer have competitive performance to their non-reversible counterparts. At the same time, our reversible models significantly reduce memory cost during the training process, saving $3.79 \times$ on the RevSResNet37 and $3.00 \times$ on the RevSFormer-4-384 model.

In summary, our contributions are three-fold:

- We analyze the reversibility of SNNs in the spatial and temporal dimensions and propose spiking reversible block for the BPTT framework. On this basis, each block’s input and intermediate variables can be calculated by its outputs.
- We propose the reversible spiking ResNet (RevSResNet) and reversible spiking transformer (RevSFormer). We redesign a series of structures (such as downsample layers, reversible spiking residual block, and reversible spiking transformer block) to match the performance of the non-reversible state-of-the-art spiking counterparts.
- The experiments show that RevSResNet and RevSFormer have competitive performance to their non-reversible counterparts. At the same time, our reversible models significantly reduce memory cost during the training process.

**Related Works**

**Spiking Neural Networks**

SNNs utilize binary spikes to transmit and compute information, while the spiking neurons (Gerstner and Kistler 2002; Yao et al. 2022) play a crucial role in converting analog membrane potentials into binary spikes. There are two methods to obtain deep SNNs: ANN-to-SNN conversion and direct training. The ANN-to-SNN conversion methods (Diehl et al. 2015; Bu et al. 2022; Deng and Gu 2021; Wang et al. 2022) convert the same structured ANNs into SNNs, which usually achieves high accuracy. However, this method is limited because the obtained SNN requires a large time step and is unable to handle neuromorphic data. The direct training method utilizes error backpropagation to train SNNs directly, where the BPTT framework (Shrestha and Orchard 2018) and surrogate gradient (Nefici, Mostafa, and Zenke 2019) techniques play a vital role. In recent years, direct training spiking structures have been proposed successively, including ResNet-like models (Lee et al. 2020; Fang et al. 2021a; Hu et al. 2021; Zhang et al. 2023), Spiking transformers (Zhou et al. 2022, 2023), NAS SNNs (Na et al. 2022; Kim et al. 2022), etc. These networks have lower latency, but the training process requires more computing resources and memory costs than ANNs. Among them, high memory cost limits the depth and time steps of the network. Thus, this article aims to reduce the memory cost of the SNN training based on reversible architectures.

**Reversible Architectures**

Reversible architectures are neural networks based on NICE reversible transformation (Dinh, Krueger, and Bengio 2014). Reversible ResNet (Gomez et al. 2017) is the first work that utilizes it for CNN-based image classification tasks. They employ reversible blocks to complete memory-efficient network training. The core of its memory saving is that the intermediate activation can be reconstructed through the reverse process. After that, other works (Hascoet et al. 2019; Sander et al. 2021; Li and Gao 2021) have further iterated on the CNN-based reversible architectures. Recently, (Mangalam et al. 2022) applied the reversible transformation to vision transformers and proposed Rev-ViT and Rev-MViT, two memory-efficient transformer structures. They found that reversible architectures have stronger inherent regularization than their non-reversible counterparts. In addition, reversible transformation has also been adopted in other networks, such as U-Net (Brügger, Baumgartner, and Konukoglu 2019), masked convolutional networks (Song, Meng, and Ermon 2019), and graph neural networks (Li et al. 2021a).

It is worth noting that the above reversible architectures are reversible in the spatial dimension, in which the forward process propagates from shallow to deep layers, and the reverse process propagates from deep to shallow layers. Unlike them, reversible RNN (MacKay et al. 2018) is reversible in the temporal dimension. It calculates hidden states in the past by reversing them from the future. SNN is a network with both spatial and temporal dimensions, while our spiking reversible block is reversible along the spatial dimension and consistent along the temporal dimension.

**Approach**

In this section, we first explain the spiking neuron model, which is the preliminary of SNNs. Then, we present our proposed spiking reversible block. Furthermore, we apply
it to spiking ResNet-like and transformer-like structures and propose the reversible spiking ResNet and reversible spiking transformer. They both support memory-efficient end-to-end training.

**Spiking Neuron Model**

The spiking neuron, which plays the role of activation function, is the fundamental unit used in SNNs. It converts analog membrane potentials to binary spiking signals. The leaky-integrate-and-fire (LIF) neuron is a widely used spiking neuron whose discrete-time dynamics can be formulated as follows:

$$H[t] = V[t-1] + \frac{1}{r_m} (I[t] - (V[t-1] - V_{reset}))$$  \hspace{1cm} (1)

$$S[t] = \Theta (H[t] - V_{th})$$  \hspace{1cm} (2)

$$V[t] = H[t](1 - S[t]) + V_{reset}S[t]$$  \hspace{1cm} (3)

where $V[t]$ represents the membrane potential at time $t$, and $H[t]$ is the hidden membrane potential before trigger time $t$. $I[t]$ is the synaptic current, which is the input from other neurons. Once $H[t]$ exceeds the firing threshold $V_{th}$, the neuron will fire a spike expressed by $S[t]$. Then, the membrane potential $V[t]$ will be reset to reset potential $V_{reset}$.

In addition to LIF, we also use (integrate-and-fire) IF neuron in this work, which is a simplified version of LIF. Its integrate dynamics (Eq.4) differs from LIF, while the fire and reset processes remain unchanged.

$$H[t] = V[t-1] + I[t]$$  \hspace{1cm} (4)

**Spiking Reversible Block**

**Computation Graph of Spiking Reversible Block** During standard backpropagation training, a single-batch is computed with a forward-backward process. In contrast, for a reversible block, this computation turns to a forward-reverse-backward process. The added reverse process utilizes the output of the block to compute the input in reverse. Then we can delete all inputs and intermediate variables after the forward process and save only the output. RevNet (Gomez et al. 2017) and RevRNN (MacKay et al. 2018) implement the reversible blocks in the spatial and temporal dimensions, respectively.

For SNNs, as long as the network is designed in a two-residual-stream manner in (Gomez et al. 2017), we can establish the reverse process in the spatial dimension. However, in the temporal dimension, the reverse means that the input potential of all neurons must be calculated through their output spikes, which is theoretically impossible for spiking neurons in Eq. 1. Therefore, spiking reversible block should be reversible along the spatial dimension and consistent along the temporal dimension. We extend the single-batch computation process to forward-reset-reverse-backward. The computation graphs for forward and reverse processes are shown in Figure 2, where $\mathcal{F}$ and $\mathcal{G}$ can be set as arbitrary spiking modules composed of spiking neurons, convolutional layers, attention mechanisms, etc. Since spiking neurons have different membrane potentials at different time steps, $\mathcal{F}$ and $\mathcal{G}$ vary with time. We use $\mathcal{F}^t$ and $\mathcal{G}^t$ to represent these two modules at the time step $t$.

In the forward process, the starting node of the graph lies in the input node at time step 1, and the end node is the output at time $T$, where $T$ is the total time steps of the SNN. At each time step $t$, output $Y^t$ is calculated using formula 5, as the horizontal arrows in Figure 2a. From time step $t$ to $t+1$, the edges of the computation graph are established through the inherited membrane potential of all spiking neurons in $\mathcal{F}$ and $\mathcal{G}$, as the red arrows illustrate in Figure 2a.

$$Y_1^t = X_1^t + \mathcal{F}^t (X_2^t)$$

$$Y_2^t = X_2^t + \mathcal{G}^t (Y_1^t)$$  \hspace{1cm} (5)
Before the reverse process, all spiking neurons are reset by resetting membrane potential to the initial state, which is named the reset process.

In the reverse process, the starting node of the graph lies in the output node at time step 1, and the end node is the input at time step \( T \). For each time step \( t \), input \( X^t \) is calculated using formula 6, as the reversed horizontal arrows in Figure 2b. From time step \( t \) to \( t + 1 \), same as forward process, the edges of the computation graph are established through the inherited membrane potential of all spiking neurons in \( F \) and \( G \), as the red arrows show in Figure 2b.

\[
\begin{align*}
X^t_1 &= Y^t_2 - G^t (Y^t_1) \\
X^t_1 &= Y^t_1 - F^t (X^t_2)
\end{align*}
\]  

(6)

**Learning without Caching Intermediate Variables**

During network training, the backward process is essential for updating the network weights. Consider the presynaptic weight \( W_l \) of a spiking neuron in the \( l_{th} \) layer. Its gradient is calculated as follows:

\[
\frac{\partial L}{\partial W_l} = \sum_t \left( \frac{\partial L}{\partial S^t_l} \frac{\partial S^t_l}{\partial U^t_l} + \frac{\partial L}{\partial U^{t+1}_l} \frac{\partial U^{t+1}_l}{\partial U^t_l} \right) \frac{\partial U^t_l}{\partial W_l}
\]  

(7)

where \( S^t_l \) and \( U^t_l \) are the output spike (activation) and membrane potential at time step \( t \), which are calculated using the spiking neuron dynamics. It can be found that the gradient calculation requires all output spikes and membrane potentials at all time steps. In fact, almost all intermediate variables in the forward process are needed in the backward process. Because of the sequential nature of the network, all intermediate variables for all layers at all time steps should be stored. Thus, peak memory usage becomes linearly dependent on the network depth \( D \) and time steps \( T \). Its spatial complexity is \( O(D \cdot T) \).

For the training of the spiking reversible block, we propose Theorem 1, which means all intermediate variables in the forward process can be recomputed from output in the reverse process. Then, only output \( Y \) needs caching in the forward process. Furthermore, if spiking reversible blocks are sequentially placed, we only need to store the output of the last block. Before the backward process of any block, we can recompute all intermediate variables with the output. In this process, the peak memory usage is the memory required for a single block whose spatial complexity is \( O(T) \). Since direct training SNNs often have relatively small \( T \) (such as 4), the peak memory usage during training is much smaller.

**Theorem 1** Consider a spiking reversible block with \( T \) time steps, if the forward and reverse functions are formulated as Eq. 5 and Eq. 6, and outputs of forward process are fed into the reverse process, then \( X^t \), \( Y^t \) and all intermediate variables (including the intermediate activations and membrane potentials) in \( F^t \) and \( G^t \) in the forward process are identical to those in the reverse process.

**Proof.** The proof of Theorem 1 is presented in the Appendix (Zhang and Zhang 2023).

**Reversible Spiking Residual Neural Network**

ResNet (He et al. 2016) is one of the most popular deep convolutional neural networks (CNNs), and residual learning is also the best solution for CNN-based SNNs to tackle the gradient degradation problem (Fang et al. 2021a). With the help of our spiking reversible block, we propose the reversible spiking residual neural network, which completes the training of deep SNNs with much less memory usage.

**Basic Block** In ANN ResNet, the parameterized residual function is wrapped around a single residual stream in each block. We adopt it to the spiking reversible block and propose the two-residual-stream architecture in Figure 3. The input \( X \) is partitioned into tensors \( X_1 \) and \( X_2 \) in halves along the channel dimension. The forward process follows transformation in Eq. 5 to ensure reversibility. We utilize two residual functions with the same structure as \( F \) and \( G \). To ensure that all operations are spike computations, we adopt the Activation-Conv-BatchNorm paradigm (Hu et al. 2021). Each residual function consists of two sequentially connected multi-step spiking neurons, convolutional layers, and batch normalization.

**Downsample Block** Due to the reversibility of the basic block, the feature dimensions of \( X \) and \( Y \) are identical. Therefore, residual functions \( F \) and \( G \) must be equidimensional in input and output spaces, which means that downsample layers (such as maxpooling or convolution with a stride of 2) cannot appear in spiking reversible blocks. To replace the downsampling basic blocks in ResNet, we set up...
Unlike ResNet, a spiking transformer block Basic Block structures in transformer-like SNNs. We propose RevSFormer and prove the feasibility of reversible block with the spiking transformer (Zhou et al. 2023), we Vision transformer has taken the accuracy of computer vision tasks to a new level. Combining our spiking reversible block with the spiking transformer (Zhou et al. 2023), we propose RevSFormer and prove the feasibility of reversible structures in transformer-like SNNs.

**Network Architecture** The high-level structure of RevSResNet is the same as its non-reversible counterpart MS ResNet (Hu et al. 2021). The first convolution is regarded as the encoding layer which performs the initial downsampling. Then the spiking features propagate through the four stages with basic blocks. We set up a downsample block at the start of the second to fourth stages. The network ends with an average pooling and fully connected layer.

When spiking reversible blocks are sequentially connected (we call it reversible sequence), we only need to store the output of the last block to complete the training. Leave out the downsample block, all stages in RevSResNet are reversible sequences. No matter how the number of blocks in a reversible sequence grows, the memory usage required by intermediate variables does not increase. The detailed architectures of RevSResNet are summarized in Table 1. RevSResNet-N means the network with N layers.

**Reversible Spiking Transformer**

Vision transformer has taken the accuracy of computer vision tasks to a new level. Combining our spiking reversible block with the spiking transformer (Zhou et al. 2023), we propose RevSFormer and prove the feasibility of reversible structures in transformer-like SNNs.

**Basic Block** Unlike ResNet, a spiking transformer block has two relatively independent residual functions: spiking self-attention (SSA) and spiking MLP block (MLP). They are wrapped around their residual connection, respectively. Under this condition, we respectively consider SSA and MLP as \( \mathcal{F} \) and \( \mathcal{G} \), and propose the basic block in RevSFormer, as is shown in Figure 4. We adopt the same SSA and MLP structure as Spikingformer (Zhou et al. 2023), so our basic block’s computational complexity and parameter numbers are consistent with the original spiking transformer block.

**Network Structure** The high-level structure of RevSFormer is the same as its non-reversible counterpart Spikingformer. The network includes a spiking tokenizer, \( L \) basic blocks, and a classification head. The spiking tokenizer computes the patch embedding of the image and projects the embedding into a fixed size with several convolutional and maxpooling layers. The classification head is composed of a spiking neuron and a fully connected layer. It is worth mentioning that all downsampling operations of RevSFormer are placed in the spiking tokenizer. Since there are no other downsampling or irreversible operations between all basic blocks, RevSFormer has only one reversible sequence composed of \( L \) basic blocks. As \( L \) grows, the memory required to store intermediate variables is expected to stay the same. The detailed configurations of RevSFormer are the same as Spikingformer. And RevSFormer-\( L \cdot D \) means the network has \( L \) blocks and the embedding dimension is \( D \).

**Experiments**

We evaluate the performance of our reversible structures on static datasets (CIFAR10 and CIFAR100) and neuromorphic datasets (CIFAR10-DVS and DVS128 Gesture). The metrics include parameters, time steps, FLOPS, memory per image, and the top-1 accuracy. The memory per image is measured as the peak GPU memory each image occupies during training. To ensure direct comparability with non-reversible counterparts, we match the model complex-

<table>
<thead>
<tr>
<th>Total layers</th>
<th>( N = 5 + 4 \sum n_i )</th>
</tr>
</thead>
<tbody>
<tr>
<td>conv1</td>
<td>( 3 \times 3, 128 )</td>
</tr>
<tr>
<td>reversible sequence 1</td>
<td>( (3 \times 3, 64) \times 2 \times n_1 )</td>
</tr>
<tr>
<td>reversible sequence 2</td>
<td>( (3 \times 3, 128) \times 2 \times n_2 )</td>
</tr>
<tr>
<td>reversible sequence 3</td>
<td>( (3 \times 3, 256) \times 2 \times n_3 )</td>
</tr>
<tr>
<td>reversible sequence 4</td>
<td>( (3 \times 3, 448) \times 2 \times n_4 )</td>
</tr>
<tr>
<td>average pool, fc, softmax</td>
<td></td>
</tr>
</tbody>
</table>

Table 1: Architectures of RevSResNet. The stride of conv1 are set to 2 for downsampling. \(*\) means that a downsample block is set at the beginning of the reversible sequence. \( N \) represents the total number of layers.

Figure 4: Basic block of RevSFormer. We consider spiking self-attention and MLP block as \( \mathcal{F} \) and \( \mathcal{G} \), respectively.
ResNet-like structures. For transformer-like structures, the network comparison (MS ResNet20 vs. RevSResNet24) for Experiment on Neuromorphic Datasets networks, which will be further discussed later.

From the memory perspective, our reversible SNNs are much more memory-efficient than vanilla SNNs. On the other hand, RevSFormer-2-384 achieve 76.4% and 82.2% accuracy on CIFAR10 and CIFAR100 each provides 50000 train and 10000 test images. On these datasets, we establish our implementation for a fair comparison. Bold values denotes the memory usage of our reversible SNNs.

### Experiment on Static Datasets

CIFAR10 and CIFAR100 each provides 50000 train and 10000 test images. On these datasets, we establish two comparisons (MS ResNet18 vs. RevSResNet21, MS ResNet34 vs. RevSResNet37) for ResNet-like structures. For transformer-like structures, the network configuration and model complexity of RevSFormer are identical to Spikingformer. Results are shown in Table 2.

From an accuracy perspective, we find that the performance of RevSResNet and RevSFormer is comparable to their counterparts with similar complexity. RevSResNet37 achieves 94.77% and 76.34% accuracy on CIFAR10 and CIFAR100 datasets, respectively, while RevSFormer-4-384 achieves 95.34% and 79.04% accuracy with a time step of 4. The performance of RevSResNet and RevSFormer is even slightly better than MS ResNet and SpikingFormer, which may be due to stronger inherent regularizability of reversible architectures than vanilla networks (Mangalam et al. 2022).

From the memory perspective, our reversible SNNs are much more memory-efficient than vanilla SNNs. On one hand, RevSResNet37 and RevSFormer-4-384 consume 23.58 and 41.74 MB GPU memory per image, which is 3.79× and 3.00× lower than their counterparts. On the other hand, the memory usage does not increase with depth in our networks, which will be further discussed later.

### Experiment on Neuromorphic Datasets

On the neuromorphic datasets, we conduct experiments with two different time steps, 10 and 16. And we establish one network comparison (MS ResNet20 vs. RevSResNet24) for ResNet-like structures. For transformer-like structures, the network configuration are identical between reversible and non-reversible structures. Results are shown in Table 3. The relative changes in accuracy and memory are similar to those on static datasets. Our RevSResNet and RevSFormer achieve a memory usage reduction of 2.01× and 1.30×, respectively. And the magnitude of the reduction stays consistent across different time steps. In terms of performance, RevSResNet24 and RevSFormer-2-256 achieve 76.4% and 82.2% accuracy on CIFAR10-DVS dataset with a time step of 16.

### Ablation Studies

#### Memory Usage vs. Depth

Theoretically, for a reversible sequence, the memory usage required by intermediate variables does not increase with the number of reversible blocks because we only need to save the output of the whole sequence. Thus, for RevSResNet with 4 reversible blocks and RevSFormer with 1 sequence, the memory usage per image should not increase with depth. Figure 1 plots the memory usage for our reversible SNNs and their counterparts. For ResNet-like structures, the relative memory saving magnitude increases up to 8.1× as the model goes deeper. For transformer networks, our RevSFormer-16-384 saves 9.1× GPU memory per image. It is expected that this memory saving magnitude will increase further with increasing depth.

#### Memory Usage vs. Time Step

The memory required by an SNN is $T$ times larger than an ANN. Thus, the GPU memory required per image grows linearly with the total time steps $T$. Figure 5 shows the relationship between memory usage and time steps. As is seen, for each model, the memory usage increases with a certain slope $m$. In our reversible SNNs, intermediate variables in the non-reversible parts (e.g., the downsample layers and the spiking tokenizer) and the output of each reversible sequence still need caching. Thus, memory usage is not decoupled from time steps $T$. 

<table>
<thead>
<tr>
<th>Methods</th>
<th>Architecture</th>
<th>Param (M)</th>
<th>Time Step</th>
<th>PLOPS (G)</th>
<th>Memory (MB/img)</th>
<th>CIFAR10 Top-1 Acc</th>
<th>CIFAR100 Top-1 Acc</th>
</tr>
</thead>
<tbody>
<tr>
<td>Hybrid training (Rathi et al. 2020)</td>
<td>VGG-11</td>
<td>9.27</td>
<td>125</td>
<td>-</td>
<td>-</td>
<td>92.22</td>
<td>67.87</td>
</tr>
<tr>
<td>Diet-SNN (Rathi and Roy 2020)</td>
<td>ResNet-20</td>
<td>0.27</td>
<td>10</td>
<td>-</td>
<td>-</td>
<td>92.54</td>
<td>64.07</td>
</tr>
<tr>
<td>STBP (Wu et al. 2018)</td>
<td>CIFARNet</td>
<td>17.54</td>
<td>12</td>
<td>-</td>
<td>-</td>
<td>89.83</td>
<td>-</td>
</tr>
<tr>
<td>STBP NeuNorm (Wu et al. 2019)</td>
<td>CIFARNet</td>
<td>17.54</td>
<td>12</td>
<td>-</td>
<td>-</td>
<td>90.53</td>
<td>-</td>
</tr>
<tr>
<td>TSLS-BP (Zhang and Li 2020)</td>
<td>CIFARNet</td>
<td>17.54</td>
<td>5</td>
<td>-</td>
<td>-</td>
<td>91.41</td>
<td>-</td>
</tr>
<tr>
<td>STBP-tdB (Zheng et al. 2021)</td>
<td>ResNet-19</td>
<td>12.63</td>
<td>4</td>
<td>-</td>
<td>-</td>
<td>92.92</td>
<td>70.86</td>
</tr>
<tr>
<td>TET (Deng et al. 2022)</td>
<td>ResNet-19</td>
<td>12.63</td>
<td>4</td>
<td>-</td>
<td>-</td>
<td>94.44</td>
<td>74.47</td>
</tr>
<tr>
<td>DS-ResNet (Feng et al. 2022)</td>
<td>ResNet20</td>
<td>4.32</td>
<td>4</td>
<td>-</td>
<td>-</td>
<td>94.25</td>
<td>-</td>
</tr>
<tr>
<td>Spikingformer (Zhou et al. 2022)</td>
<td>Spikingformer-4-384</td>
<td>9.32</td>
<td>4</td>
<td>-</td>
<td>94.19</td>
<td>77.86</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Methods</th>
<th>Architecture</th>
<th>Param (M)</th>
<th>Time Step</th>
<th>PLOPS (G)</th>
<th>Memory (MB/img)</th>
<th>CIFAR10 Top-1 Acc</th>
<th>CIFAR100 Top-1 Acc</th>
</tr>
</thead>
<tbody>
<tr>
<td>MS ResNet (Hu et al. 2021)</td>
<td>MS ResNet18</td>
<td>11.22</td>
<td>4</td>
<td>2.22</td>
<td>54.83</td>
<td>94.40</td>
<td>75.06</td>
</tr>
<tr>
<td>RevSResNet (ours)</td>
<td>RevSResNet21</td>
<td>11.05</td>
<td>4</td>
<td>2.38</td>
<td>23.59 ↓ 2.32×</td>
<td>94.53</td>
<td>75.46</td>
</tr>
<tr>
<td>MS ResNet (Hu et al. 2021)</td>
<td>MS ResNet34</td>
<td>21.33</td>
<td>4</td>
<td>4.64</td>
<td>89.33</td>
<td>94.69</td>
<td>75.34</td>
</tr>
<tr>
<td>RevSResNet (ours)</td>
<td>RevSResNet37</td>
<td>23.59</td>
<td>4</td>
<td>4.66</td>
<td>23.58 ↓ 3.79×</td>
<td>94.77</td>
<td>76.34</td>
</tr>
<tr>
<td>Spikingformer (Zhou et al. 2023)</td>
<td>Spikingformer-2-384</td>
<td>5.76</td>
<td>4</td>
<td>2.79</td>
<td>83.05</td>
<td>95.12</td>
<td>77.96</td>
</tr>
<tr>
<td>RevSFormer (ours)</td>
<td>RevSFormer-2-384</td>
<td>5.76</td>
<td>4</td>
<td>2.79</td>
<td>41.68 ↓ 1.99×</td>
<td>95.29</td>
<td>78.04</td>
</tr>
<tr>
<td>Spikingformer (Zhou et al. 2023)</td>
<td>Spikingformer-4-384</td>
<td>9.32</td>
<td>4</td>
<td>3.70</td>
<td>125.06</td>
<td>95.35</td>
<td>79.02</td>
</tr>
<tr>
<td>RevSFormer (ours)</td>
<td>RevSFormer-4-384</td>
<td>9.32</td>
<td>4</td>
<td>3.70</td>
<td>41.74 ↓ 3.00×</td>
<td>95.34</td>
<td>79.04</td>
</tr>
</tbody>
</table>

Table 2: Comparison to prior works on static datasets, CIFAR10 and CIFAR10. Note that results of MS ResNet and Spikingformer are based on our implementation for a fair comparison. Bold values denotes the memory usage of our reversible SNNs.
### Table 3: Comparisons with prior works on neuromorphic datasets, CIFAR10-DVS and DVS128 Gesture. Note that results of MS ResNet and Spikingformer are based on our implementation for a fair comparison. Bold values denote the memory usage of our reversible SNNs.

![Figure 5: Relationship between memory and time step.](image)

However, through reversible architecture, we have greatly reduced the slope of memory usage growth from 28.5 and 20.2 of non-reversible SNNs to 9.6 and 5.3 of our reversible networks.

### Conclusion

In this paper, we propose the reversible spiking neural network to reduce the memory cost of intermediate activations and membrane potentials during training of SNNs. We first extend the reversible architecture along temporal dimension and propose the reversible spiking block, which can reconstruct the computational graph of forward pass with a reverse process. On this basis, we present the RevSResNet and RevSFormer models, which are the reversible counterparts of the state-of-the-art SNNs. Through experiments on static and neuromorphic datasets, we demonstrate that the memory cost per image of our reversible SNNs does not increase with the network depth. In addition, RevSResNet and RevSFormer achieve comparable accuracies and consume much less GPU memory than their counterparts with roughly identical model complexity and parameters.
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Acknowledgments

This work was supported by STI 2030-Major Projects 20212ZD0201403, in part by NSFC 62088101 Autonomous Intelligent Unmanned Systems.

References


