# **Applied Machine Learning Days @ EPFL**

# State of the art in hardware-accelerated neural networks

**Frédéric Pétrot**, Lorena Anghel, Liliana Andrade Univ. Grenoble Alpes, CNRS, Grenoble INP<sup>\*</sup>, TIMA, F-38000 Grenoble, France

- Atima.imag.fr/sls/people/petrot
- frederic.petrot@univ-grenoble-alpes.fr

<sup>\*</sup>Institute of Engineering Univ. Grenoble Alpes



## The Brain: the Ultimate Autonomous System

▶ 1,2 to 1,4 kg, 1260 cm<sup>3</sup>

Consumes between 15 and 30 Watts

▶ 86×10<sup>9</sup> neurones,  $\approx$  10<sup>12</sup> synapses

## The Brain: the Ultimate Autonomous System

1,2 to 1,4 kg, 1260 cm<sup>3</sup>

Consumes between 15 and 30 Watts

▶ 86×10<sup>9</sup> neurones,  $\approx$  10<sup>12</sup> synapses



J. Vitti and D. Silverman, "Bart the Genius", The Simpsons, 1990

## The Brain: the Ultimate Autonomous System

1,2 to 1,4 kg, 1260 cm<sup>3</sup>

Consumes between 15 and 30 Watts

▶ 86×10<sup>9</sup> neurones,  $\approx$  10<sup>12</sup> synapses



J. Vitti and D. Silverman, "Bart the Genius", The Simpsons, 1990

## Human Brain Project (EU Flagship)

16,000 neurones per 1 Watt chip

5.375 MW/brain

1.2×10<sup>9</sup> € from Europe: ≈ 1.4 cent of €/neurone

## The Brain: the Ultimate Autonomous System

- 1,2 to 1,4 kg, 1260 cm<sup>3</sup>
- Consumes between 15 and 30 Watts
- ▶ 86×10<sup>9</sup> neurones,  $\approx$  10<sup>12</sup> synapses
- (Energy comes from eggs, honey and mushrooms)



J. Vitti and D. Silverman, "Bart the Genius", The Simpsons, 1990

## Human Brain Project (EU Flagship)

- 16,000 neurones per 1 Watt chip
- 5.375 MW/brain
- 1.2×10<sup>9</sup> € from Europe: ≈ 1.4 cent of €/neurone



## The Brain: the Ultimate Autonomous System

- 1,2 to 1,4 kg, 1260 cm<sup>3</sup>
- Consumes between 15 and 30 Watts
- ▶ 86×10<sup>9</sup> neurones,  $\approx$  10<sup>12</sup> synapses
- (Energy comes from eggs, honey and mushrooms)



J. Vitti and D. Silverman, "Bart the Genius", The Simpsons, 1990

## Human Brain Project (EU Flagship)

- 16,000 neurones per 1 Watt chip
- 5.375 MW/brain
- ▶  $1.2 \times 10^9 \in$  from Europe:  $\approx 1.4$  cent of  $\in$ /neurone



# **Classical Deep Neural Network (DNN) Topology**



$$u_{\mathrm{i}} = \sum_{\mathrm{i}=1}^{n} x_{\mathrm{i}} w_{\mathrm{ij}}$$
  $y_{\mathrm{i}} = \operatorname{act}(u_{\mathrm{i}} + b_{\mathrm{i}})$ 

## Two phases

- Learning, off-line, floating point
- Inference, run-time, floating point

## Weights

- Values "found" during learning
- Values used during inference

## Issues

- Complex learning algorithms: GPU farms
- Limited computing power during inference

# **CNN Models: Accuracy, Operations and Weights**



A. Canziani, E. Culurciello, A. Paszke, "An Analysis of Deep Neural Network Models for Practical Applications", 2017 (EfficientNet-BO/B7 added by myself)

# Our focus? Inference!

## What's the interest, by the way?

## Local computation!

Energy

No router, cloud server, ...

- $\Rightarrow$  Huge constraint in Edge Computing
- $\Rightarrow$  Worse in IoT
- $\Rightarrow$  Transmitting data costs energy

Latency

Immediate response, no dead zone, no network reliability issue, ...

## Privacy

No storage in someone else's servers Neither wire nor wireless sniffing possible

## Constraints on HW Accelerated Neural Networks

- Accuracy needs depend on the application
- Silicon resources:
  - $\Rightarrow$  Computations to perform
  - $\Rightarrow$  Weights storage and access
- Energy efficiency Typical constraints :
  - 10-100 uW for wearables,
  - 10-100 mW for phones,
  - 1-10 W for plugged devices

## Constraints on HW Accelerated Neural Networks

- Accuracy needs depend on the application
- Silicon resources:
  - $\Rightarrow$  Computations to perform
  - $\Rightarrow$  Weights storage and access
- Energy efficiency Typical constraints :
  - 10-100 uW for wearables,
  - 10-100 mW for phones,
  - 1-10 W for plugged devices



## Inference involves a lot of computation...

 Elevated number of floating point (FP) operations

 $\rm 0.5G \leq Nb$  of FLOPs  $\leq 40G$ 

 Floating point operations are energy and area costly

(My 4 core-i7 PC ~120 GFLOPs  $\Rightarrow$  30 GFLOPs/core.)



"Hardware Architectures for Deep Neural Networks", ISCA Tutorial, 2017

### Inference involves a lot of memory access...



| Operation:          | Energy<br>(pJ) | Relative Energy Cost |    |                 |                 |     |
|---------------------|----------------|----------------------|----|-----------------|-----------------|-----|
| 32b SRAM Read (8KB) | 5              |                      |    |                 |                 |     |
| 32b DRAM Read       | 640            |                      |    |                 |                 |     |
|                     |                | 1                    | 10 | 10 <sup>2</sup> | 10 <sup>3</sup> | 104 |

"Hardware Architectures for DNN", ISCA Tutorial, 2017

- ▶ Memory stores millions of (64-bit) weights  $\Rightarrow$  4M (GoogLeNet), 60M (AlexNet), 130M (VGG)
- Memory access becomes the bottleneck
   ⇒ Each op needs 2 operands and produces a result
- An "elevated" power consumption is involved

## Alternatives: trade FLOPs for (some) accuracy loss

Simplify the operations

- Avoid sigmoid, batch normalization and stuff
- FP arithmetic is not HW friendly
  - $\Rightarrow$  Use data-types that are not 64-bit floats

## Alternatives: trade bytes for (some) accuracy loss

- Use "small" data types
- Integrate many memory cuts with processing elements and use them wisely
- Integrate computation into the memory itself



K. Usher, "The Dwarf in the Dirt", Bones, 2009 (Energy comes from donuts + beer)







Exploit weight sparsity to optimize memory usage and weight placement Use low precision/high efficiency computation along with on-chip memory storage of the weights Integrate computation inside the memory itself, directly where the data is stored

## Quantization levels and accuracy...



Kees Vissers, "A Framework for Reduced Precision Neural Networks on FPGAs", MPSOC, 2017



## Custom hardware for sparse matrix-vector multiplication

## **Deep Compression Technique**

## **Reduces storage requirements**

- Dedicated sparse matrix/vector representation
   ⇒ Eliminates redundant connections
- Quantizes weights

## **Quantization of AlexNet weights**

- > 256 shared weights (Conv layers)  $\Rightarrow$  4 bits
- > 35x of reduction (240MB  $\Rightarrow$  6.9MB)

## Weights stored into on-chip SRAM

 $\Rightarrow$  5 pJ/access (vs. 640 pJ/access off-chip DRAM)



# High weights quantization with floating-point activations

#### Processor ANN INST DATA INST DATA Interconnect

# Acceleration using Low-Precision (ternary) weights Only balanced ternary weight are used {-1, 0, +1} Floating point accumulations are kept Multipliers are not needed

## Most of the FP operations operate on zero values



## **Demonstrated highest accuracy**

Non-Zero Fraction

- $\Rightarrow$  93% on the ImageNet object classification challenge
- $\Rightarrow$  Divide by 3 the number of FP operations

# **Exploit full-quantization**



## YodaNN: VLSI Implementation of binary-weights CNN Accelerator

## **Based on BinaryConnect approach**

- ▶ Binary weights  $\in \{-1, +1\}$
- 2's complement and multiplexers instead of multipliers
- Still full fledge adders: 12-bit activations

## Large on-chip weights storage thanks to their size

Latch-based standard cell memory

## **Flexible accelerator**

- > 7 kernel sizes supported
- $\Rightarrow$  61.2 TOP/s/W at 0.6V





## FINN: Framework for building FPGA<sup>\*</sup> accelerators

Mapping binarized neural networks to hardware All values  $\in \{-1, +1\}$ 

- Binary input activation
- Binary synapse weights
- Binary output activation

## Weights kept in on-chip memory

- $\Rightarrow$  Zynq-7000 FPGA technology
- $\Rightarrow$  80.1% accuracy for CIFAR-10
- $\Rightarrow$  Total system power 25W



- Dot-product between input vector and row of synaptic weight matrix
- Compares result to a threshold
- Produces single-bit output

Y. Umuroglu et al., "FINN: A Framework for Fast, Scalable Binarized Neural Network Inference", FPGA, 2017

<sup>\*</sup>Field-Programmable Gate-Array: fine-grain reconfigurable hardware technology.

# Ternary weights and ternary activations





## FPGA Architecture for Ternary Neural Networks (TNN)

- Large-scale ternary CNN pipeline VGG-like (NN-64 or NN-128)
- ▶ Neuron layer (NL) → memory (ternary weights) + neurons
- ▶ Ternarization layer (TL)  $\rightarrow$  ternary activations  $\in \{-1, 0, +1\}$
- $\Rightarrow$  Error rate 13.29% for CIFAR-10 (vs. 19.9% in FINN)
- $\Rightarrow$  Virtex-7 FPGA technology (VC709, Laaaaaaarge FPGA)
- $\Rightarrow$  1.62 TOP/s/W (vs 0.69 TOP/s/W in FINN)
- $\Rightarrow$  Throughput > 60k fps

A. Prost-Boucle et al., "High-Efficiency Convolutional Ternary Neural Networks with Custom Adder Trees and Weight Compression", ACM TRETS, 2018

# Computations using spikes, not bits



## TrueNorth: Integrated Chip for Spiking Neural Networks





Romain Brette, Computing with spikes

- Neurons communicate sending spikes
- Data encoded according to frequency, time, and spatial distribution of spikes

## Non-Von Neumann architecture

- ⇒ 4096 neuromorphic cores ⇒ 1 million of digital neurons ⇒ 256 millions of synapses ⇒ 46 GSOP/s/W at 65 mW Asynchronous logic implementation
- Neuromorphic core = 256 neurons (PE) + 64k synapses (memory)
- Memory and computation physically close to each other
  - Reduction of power consumption

F. Akopyan et al., "Truenorth: Design and Tool Flow of a 65mW 1 million Neuron Programmable Neurosynaptic Chip", TCAD, 2015

# **Emerging Processing-In-Memory (PIM) Approaches**



## MAC operations using Non Volatile Memory (NVM)



## Computation accelerated by NVM arrays

- Synaptic weights are **not** stored in external memories
  - $\Rightarrow$  Zero transfers between memory and processing elements
  - $\Rightarrow$  Reduction of energy consumption

## Arrays of resistive RAM devices

- Resistances vary according to voltages
- No CMOS access devices but complex peripheral circuitry
- Analog, intrinsically approximate, computations



## Convolution Kernel in $12 \times 12$ Array



## From receptive field to feature maps...

- Receptive field = row voltages
- Convolution kernel = column of resistive devices
- Convolution operation = column current

Interactive protocol programs kernels

## Single demonstration on digits of MNIST

L. Gao et al., "Demonstration of Convolution Kernel Operation on Resistive Cross-Point Array", IEEE Electron Device Letters, 2016 P. Chi et al., "PRIME: A Novel Processing-In-Memory Architecture for Neural Network Computation in ReRam-Based Main Memory", ISCA, 2016

## $\Rightarrow$ Artificial synapse using NVM



- Modeling synapses between neurons
- Input and output potentials fired between neurons (spikes)
- Synaptic connections are potentiated or depressed

## Electronic Synapses Modeled by Phase Change Memory (PCM) Devices

- Programmed in different states (conductances)
- Compatible with CMOS components
- Scalable to nanometric dimensions





# **PCM as Synaptic Element**

## 500x661 PCM Crossbar Array

## Large scale implementation

- 3-layer perceptron
- 916 neurons
- 164865 synaptic connections
- ⇒ Accuracy: 82% (MNIST) ⇒ Low-power: at least 120x (vs. GPU)



S. Burc et al., "Experimental Demonstration of Array-Level Learning with Phase Change Synaptic Devices", IEDM, 2013 G.W. Burr et al., "Experimental Demonstration and Tolerancing of a Large-Scale NN (165000 Synapses) using PCM as the Synaptic Weight Element", 2015 G.W. Burr et al., "Large-Scale Neural Networks Implemented with NVM as the Synaptic Weight Element: Comparative Performance Analysis", IEDM, 2015

# **Take Away**

## Classical digital (CMOS) architectures

- Quantization and compression is the way to go
- Few bits for weights and activations are enough in many cases
- Learning need to be changed according to bit-width
- Networks and HW architectures available today!





# **Take Away**

## Classical digital (CMOS) architectures

Quantization and compression is the way to go
 Few bits for weights and activations are enough in many cases
 Learning need to be changed according to bit-width
 Networks and HW architectures available today!



- Few experiments of NVM crossbar array implementations at large scale
- Comparison of energy gains is difficult
- Flexibility of the NVM to match a given ANN architecture questionable
- Promising ongoing research subject

TinvM