A few days ago, Pete Warden, whose work inspired me to get into tinyML, released a blog titled “One weird trick to shrink convolutional networks for tinyML." In it, he talks about how we can replace a combination of convolutional and pooling layers with a single convolutional layer with a stride of 2. The advantage of this is twofold: firstly, you get the same output size in both cases, but do not need to store the output of the convolutional layer which saves a lot of memory (1/4th less memory), and secondly you perform fewer computes so you get an increase in inference time as well. However, Pete also points out that this method might result in a drop in accuracy, but with the decrease in resource usage, you can regain that accuracy by changing some other hyperparameters of your model.
Pete’s trick reminded me of some convolutional neural network optimizations that I’ve studied myself, and in this article I would like to share them. In particular I want to dive a bit deeper into three things that Pete talks about: memory, pooling and computes. Let’s start with computes.
Computes
When you perform any operation with floating point numbers, like an addition or multiplication, it is called a FLOP or floating point operation. In convolutional and fully connected neural network layers, we usually perform a multiplication followed by an addition. Since this is a fairly common combination of operations, they are clubbed together as a single MAC or Multiply and Accumulate operation. Depending on the hardware, 1 MAC can be considered to be 2 FLOPs.
Calculating the number of MACs in a layer can give us an idea of how computationally complex a layer is and how long it will take to execute it (more MACs → more complex).
Let’s consider a convolutional layer with 32 filters each of size 3x3. If we feed it an input of 10x10x3, the output will have a shape of 8x8x32. It’s easy to calculate the number of MACs needed to execute this convolutional layer: to generate 32, 8x8 output feature maps, each of our 32, 3x3 kernels would have to iterate over the image 8 times across the width of the image and 8 times across the height of the image. In each iteration, it will perform 3x3x3 (extra 3 for the number of input channels) MACs. This means that the total number of MACs is the product of the number of kernels with the shape of the kernel and the output feature map height and width which in this case is 55,296 MACs:
MACs = N x DK x DK x C x WO x HO
= 32 x 3 x 3 x 3 x 8 x 8
= 55,296
Where, N is the number of kernels, DK is the kernel width or height, C is the number of input channels and WO, HO is the output width and height.
If we increase the stride to 2, then the number of MACs reduces by a factor of 4 since the output width and height is now halved:
MACs = N x DK x DK x C x WO x HO
= 32 x 3 x 3 x 3 x 4 x 4
= 13,824
That might seem like a lot, but we can do better!
Depthwise Separable Convolutions is a type of convolutional layer where we divide a standard convolution into a depthwise convolutional and a pointwise convolutional layer. The input and output shape for the layer remains the same, but we perform much fewer MACs.
The number of MACs in the depthwise layer is DK x DK x C x WO x HO and for the pointwise layer it is N x C x WO x HO. For the same convolutional layer, our total MACs now becomes:
MACs = DK x DK x C x WO x HO + N x C x WO x HO
= 3 x3 x 3 x 8 x 8 + 32 x 3 x 8 x 8
= 1,728 + 6,144
= 7,869
That is just 14.2% of the original number of MACs with much less accuracy drop! In fact, your total savings will be a factor of: 1/N+1D²K
Memory
Memory is often the biggest bottleneck in tinyML hardware. This is because microcontrollers do not have a lot of it, for instance the Arduino Nano 33 BLE Sense has just 256 KB of SRAM and 1 MB of flash memory. Secondly, reading and writing to memory is costly, both in terms of energy consumption and latency.
In neural networks, there are two things that need to be stored: the weights and the intermediate activations when you execute your network. If your hardware/software implementation stores all the weights into memory at initialization, then your max memory requirement will be the sum of your model size and the largest activation that is generated. Some implementations may choose to instead load each layer to memory when it needs to be executed. This will reduce your overall memory requirements, but it will come at the cost of more energy or latency. So you will want to make your networks, associated code and activations as small as possible to make sure that it can all fit into memory.
Let’s say our convolutional layer from the previous example uses 8 bits for weights and activations, what would be the memory required to execute it?
Each filter will have DK x DK x C weight values, so for N filters, our total number of weight values will be:
Weights = N x DK x DK x C
= 32 x 3 x 3 x 3
= 864
At 8 bits for each weight, that will take 864 x 8 = 6,912 bits or 864 bytes.
On the other hand, an equivalent depthwise separable convolutional layer will have only DK x DK x C + N x C weight values or 123 values, or only 14% of our original convolutional layer.
Our original output shape was 32x8x8 or it had 2,048 values. So, for the original convolutional layer, the network weight makes up 30% of the total memory requirement for that layer.
On the other hand, for the depthwise separable convolutional layer, the weights make up only 5.6% of the total layer memory requirement. While increasing the stride to 2, will reduce our activation memory requirement by a factor of 4, the convolutional layer will still use the same amount of memory and it’s weight percentage to the total memory requirement will be 62.7%!
This of course does not take into account the memory needed to store any intermediate results generated while executing the layer, since it depends on how each layer’s execution flow is implemented in hardware/software. But even then, the amount of intermediate values generated while executing a depthwise separable layer is far less than that generated by a vanilla convolutional layer.
Depthwise separable convolutions are supported on TFLite Micro as well as many other tinyML frameworks. It should also be easy to replace any convolutional layers (since input and output activation sizes are the same) and retrain your model in software to get a sense of how it affects your accuracy and performance.
Pooling
Pooling layers have been shown to either not improve or even degrade CNN accuracy. However, for tinyML applications, I really like them and use them frequently for two reasons: firstly, as a layer, they have no associated memory (since pooling is just an operation with no weights) and secondly, they are a computationally cheap way to downsample activations and hence reduce memory and computes when executing downstream layers. Further, in my experience, any drop in accuracy from using pooling is far outweighed by the decrease in inference latency and memory (though your mileage with accuracy may vary).
Unfortunately, most of the issues with using pooling layers in tinyML models is with our current hardware and software implementation. The biggest one being that we need to save all the activations from a previous layer to memory before we can apply a pooling layer. Depending on how big that activation is, that can be a huge chunk of memory and may even exceed the RAM capacity of tiny microcontrollers like the Arduino Nano.
In his blog, Pete cites this as the main reason for using strides and I agree with him. He however also mentions that using a tiling architecture where we process images in small sections thus leading to smaller activations. These smaller activations could be stored in accumulators/registers (tiny memories inside the processing unit that is mostly used to store intermediate results) and convolutions+pooling layer pairs could be fused to form a single layer and executed really efficiently. There have been a few custom hardwares that have implemented this already with really good results like HyNNA.
Another simple way to improve pooling performance is to use only Average Pooling layers. This is because dividing by 4 (for a pooling filter size of 2 by 2) can be performed by doing a 2 bit right shift which is extremely quick and efficient in hardware. For example, 16 in binary is 10000 and 4 is 100 which is 16 in binary right shifted by 2.
This is again something that depends on the hardware/software implementation of these operations which is why we mostly find something like this in custom built hardware.
In short, if you are looking to optimize your CNN models, I would encourage you to experiment with depthwise separable convolutions since they occupy less space in memory and require much fewer MACs to execute.
Soham Chatterjee is a deep learning researcher with over three years of experience in researching, building, deploying and maintaining computer vision and NLP products. He’s also a course instructor for Udacity’s AI for Edge IoT Nanodegree.