2<sup>nd</sup> Symposium on AI Center, December 8, 2018, The University of Aizu

# Artificial Intelligence Chips: From Data Centers to Edge and IoT Computing



### Abderazek Ben Abdallah Adaptive Systems Laboratory benab@u-aizu.ac.jp

### Al Hardware is ... everywhere

### Self-driving Car



Bottom Image source: edition.cnn.com

#### Smart Robots



Image source: roboticsbusinessreview.com

#### **Machine Translation**



Bottom Image source: missqt.com

Gaming



Bottom Image Source: newatlas.com

# Al Hardware is ... everywhere



Bottom Image source: edition.cnn.com

#### **Smart Robots**



Image source: roboticsbusinessreview.com



Bottom Image source: missqt.com



### Al Hardware is ... everywhere

### **Brain implant allows paralysed monkey to walk**

There really is a kind of intelligence inside the spinal cord. We are not just talking about reflexes that automatically activate muscles. In the spinal cord there are networks of neurons able to take their own decisions

-Grégoire Courtine-

Neuroscientist, Federal Institute of Technology, Lausanne

#### PARALYSED PRIMATES WALK

A wireless implant bypasses spinal-cord injuries in monkeys, enabling them to move their legs.





Nature volume539, pages284–288 (10 November 2010)

#### Al Revenue &

#### GDP Growth Rate in 2035 comparing Baseline Growth to Al scenario



Artificial Intelligence Revenue, World Markets: 2016-2025

Annual growth rates in 2035 of gross value added (a close approximation of GDP), comparing baseline growth in 2035 to an artificial intelligence scenario where AI has been absorbed into the economy Source: Accenture and Frontier Economics



Source: https://semiengineering.com/what-does-an-ai-chip-look-like/

#### Al Revenue &

#### GDP Growth Rate in 2035 comparing Baseline Growth to Al scenario



Artificial Intelligence Revenue, World Markets: 2016-2025

Annual growth rates in 2035 of gross value added (a close approximation of GDP), comparing baseline growth in 2035 to an artificial intelligence scenario where AI has been absorbed into the economy Source: Accenture and Frontier Economics

Baseline Al steady rate

Source: https://semiengineering.com/what-does-an-ai-chip-look-like/

#### Al Revenue &

#### GDP Growth Rate in 2035 comparing Baseline Growth to Al scenario



Governments are competing to establish advanced AI research, seeing AI as a way for greater economic power and influence.



Source: CBINSPIGHTS 2018

# Agenda

### Fundamental Trends

- AI The 4<sup>th</sup> Industrial Revolution
- Survey of Al Hardware
  - Cloud AI Hardware, Chips
  - ➢ Mobile AI Chips
  - Edge and IoT AI Chips
  - Healthcare AI Chips
- Conclusions

8

### Moore's law is no longer providing more Compute



Source: Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Communications of the ACM, September 2018, Vol. 61 No. 9, Pages 50-55.

### Moore's law is no longer providing more compute



**\*\*Dennard scaling**: As transistors get smaller their power density stays constant, so that the power consumption stays in proportion with area: both voltage and current scale (downward) with length (WP).

Source: Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Communications of the ACM, September 2018, Vol. 61 No. 9, Pages 50-4 3

### Moore's law is no longer providing more compute



Source: Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Communications of the ACM, September 2018, Vol. 61 No. 9, Pages 50- 9

### **Technology Transformation**

# Massive amounts of data is generated

### A new style of IT emerging



Source: https://practicalanalytics.files.wordpress.com/2012/10/newstyleofit.jpg

12

### **DNN Compute Requirements is SteadilyGrowing**

| Metrics          | LeNet-5 | AlexNet  | VGG-16        | GoogLeNet<br>(v1) | ResNet-50 |
|------------------|---------|----------|---------------|-------------------|-----------|
| Top-5 error      | n/a     | 16.4     | 7.4           | 6.7               | 5.3       |
| Input Size       | 28x28   | 227x227  | 224x224       | 224x224           | 224x224   |
| # of CONV Layers | 2       | 5        | 16            | 21 (depth)        | 49        |
| Filter Sizes     | 5       | 3, 5,11  | 3             | 1, 3 , 5, 7       | 1, 3, 7   |
| # of Channels    | 1, 6    | 3 - 256  | 3 - 512       | 3 - 1024          | 3 - 2048  |
| # of Filters     | 6, 16   | 96 - 384 | 64 - 512      | 64 - 384          | 64 - 2048 |
| Stride           | 1       | 1, 4     | 1             | 1, 2              | 1, 2      |
| # of Weights     | 2.6k    | 2.3M     | 14.7M         | 6.0M              | 23.5M     |
| # of MACs        | 283k    | 666M     | 15.3G         | 1.43G             | 3.86G     |
| # of FC layers   | 2       | 3        | 3             | 1                 | 1         |
| # of Weights     | 58k     | 58.6M    | 124M          | 1M                | 2M        |
| # of MACs        | 58k     | 58.6M    | 124M          | 1 M               | 2M        |
| Total Weights    | 60k     | 61M      | 138M          | 7M                | 25.5M     |
| Total MACs       | 341k    | 724M     | 15.5 <b>G</b> | 1.43G             | 3.9G      |

Source: Joel Emer, ISCA Tutorial, 2017

### What does it mean ?



### **Current State of the Art in Neural Algorithms HW Computing**



### **Current State of the Art in Neural Algorithms HW Computing**



### CPUs, GPUs, FPGAs or ASICs ?

# The only tricky part is getting them to do AI computation quickly and efficiently.



### Hardware: Flexibility vs Efficiency

Deployment alternatives for deep neural networks (DNNs) and examples of their implementations. (Image courtesy of Microsoft.)

# Agenda

Fundamental Trends

AI – The 4<sup>th</sup> Industrial Revolution

• Survey of Al Hardware

Cloud AI Hardware, Chips

- ➢ Mobile AI Chips
- Edge and IoT AI Chips
- Healthcare AI Chips
- Conclusions

19

# Four Main Factors in Promoting AI/AI HW



### Hardware & Data Enable DNNs

AI model performance scales with dataset size and the *#* of model parameters, thus necessitating more compute.



Dally, NIPS'2016 workshop on Efficient Methods for Deep Neural Networks

# AI HW is inspired by Nature – Biological neuron AI Chips and systems are inspired

### AI HW is inspired by Nature – Biological neuron

AI Chips and systems are inspired by biology → parallel computation.

Latest digital DL processors:

~10TOPS/W

=1~10 POPS/W

- ★ # of neurons: ~10<sup>11</sup>
  Synapse op. in brain: 0.1~1 fJ/op
  1,000~10,000 TOPS/W
- # of synapses: ~ $10^{15}$
- ✤ Power consumption: ~ 20 W;
- ✤ Operating frequency: 10~100 Hz
- Works in parallel: 10<sup>6</sup> parallelism vs. <10<sup>1</sup> for PC (VN)
- Faster than current computers: i.e. simulation of a 5 s brain activity takes
   ~500 s on state-of-the- art supercomputer

#### ...there are many topologies for mimicking the brain functions



23

### **Different approaches to AI Chips**



24

### **Current AI Chip = Accelerator/Co-processor**



### **Accelerator Characteristics**



fp32





### ...Deep Leering is considered as a sophisticated "rocket" of Machine Learning!!



- 1. "Deep Learning" means using a neural network with <u>several layers of nodes</u> between input & output
- the series of layers between input & output do feature identification and processing in a series of stages, just as our brains seem to.

### **Example1: Character Recognition on FPGA**



| Memory     | DSP Block | Power Consumption |  |
|------------|-----------|-------------------|--|
| 4,956 (1%) | 54 (77%)  | 286.84 mW         |  |



### **Example2: Handwriting Digit Recognition on FPGA**



### **Example of Neural Network**



### **Example of Neural Network**



### **Example of Neural Network**



Different parameters define different function

### **Matrix Operation**



### **Neural Network**



### **Neural Network**



 $\mathbf{y} = f(\mathbf{x})$ 

Parallel computing techniques are needed to speed up <u>matrix operations</u>









b 2 b

# **DL is Computationally Expensive**

- The two phases of NN are called *training* (or learning) and *inference* (or prediction), and they refer to development versus production.
- The Developer chooses the number of layers and the type of NN, and training determines the weights.
- Virtually all training today is in floating point.
- A step called *quantization* transforms floating-point numbers into narrow integers—often just 8 bits—which are usually good enough for inference.
- 8-bit integer multiplies can be 6X less energy and 6X less area than IEEE 754 16-bit FPMs, and the advantage for integer addition is 13X in energy and 38X in area [Dal16].37
#### A more biological version: LIF/SRM Model



## A more biological Model: Molecular Basis





## **Electronic devise vs chemical device**





- Deliver the concentration difference of K+,Na+
- Action potential  $\sim$  70 mV
  - Extreme low voltage operation
  - > Noise problem
  - Multiple signal input/ integration
- Spatial and temporal multiplexing → Active sharing of the interconnect
- Chemical computing, extremely low operation voltage (<100mV) → Low power</li>

#### **Hodgkin-Huxley Model**



#### **Hodgkin-Huxley Model**



## **Action Potential (Synapse) Storage**

The electrical resistor is not constant but depends on the history of current that had previously flowed through the device.



Voltage pulses can be applied to a memristor to change its resistance, just as spikes can be applied to a synapse to change its weight.

#### Wiring via AER – address Event Representation



43

## Spike-timing-dependent plasticity (STDP)



- Adjusts the strength of connections between neurons in the brain.
  - ✓ Adjusts the connection strengths based on the relative timing of a particular neuron's output and input action potentials.

## Agenda

47

- Fundamental Trends
- AI The 4<sup>th</sup> Industrial Revolution
- Survey of AI Hardware
  - Cloud AI Hardware, Chips
  - ➢ Mobile AI Chips
  - Edge and IoT AI Chips
  - Healthcare AI Chips
- Conclusions

## **Big Corps Al Chips**



46

[source: medium.com]

assistant Echo to make Alexa

faster and smarter.

## The are two AI Chip Models: ANN and SNN

- The output of ANN Chip depends only on the current stimuli, the output of SNN depends on previous stimuli also
- The SNN/Neuromorphic Chip operates on biology-inspired principles to improve performance and increase energy efficiency



## **Training & Inference**



## **Neuromorphic/SNN AI-Chips**

- Neuromorphic Sensors electronic models of retinas and cochleas.
- **Smart sensors** tracking chips, motion, pressor, auditory classifications and localization sensors.
- Models of specific systems: e.g. lamprey spinal cord for swimming, electric fish lateral line.
- Pattern generators for locomotion or rhythmic behavior
- Large-scale multi-core/chip systems – for investigating models of neuronal computation and synaptic plasticity.







Neurogrid (Stanford) TrueNorth (IBM)





Brainscales/HBP (Heidelberg, Lausanne) SpiNNaker (Manchester)

#### Example Loihi Al-Chip - a 60-mm2 chip fabricated in Intel's 14-nm



M. Davieset al., "Loihi: A Neuromorphic Manycore Processor with On-Chip Learning," IEEE Micro, vol. 38, no. 1, pp. 82-99, January/February 2018

# **Cloud AI-Chips**

## **Acceleration enterprise AI with DL Cloud**



#### **Custom ASIC: Tensor Processing Unit (TPU)**

## **TPU** is deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NNs).



Floor Plan of TPU die



**TPU Printed Circuit Board** 

Source: TensorFlow.org



Source: In-datacenter Performance Analysis of a Tensor Processing Unit Jouppi et al, ISCA, 6/2017

#### **Custom ASIC: Tensor Processing Unit (TPU)**

**TPU** is deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NNs).



Google's first Tensor Processing Unit (TPU) on a printed circuit board (left); TPUs deployed in a Google datacenter (right)

Source: <u>cloud.google.com</u>

- The TPU board can perform 92 TeraOps/s (TOPS). It is **15 to 30 times faster than CPUs and GPUs** tasked with the same work, with a 30- to 80-fold improvement in TOPS/W.
- The software used for comparison of systems was the TensorFlow framework.

#### **Experience Cloud TPU: https://github.com/tensorflow/tpu https://cloud.google.com/tpu/docs**

Source: In-datacenter Performance Analysis of a Tensor Processing Unit Jouppi et al, ISCA, 6/2017

#### **Custom ASIC: Tensor Processing Unit (TPU)**



#### **TPU is based on the Systolic Array Idea**

The matrix unit uses systolic execution to save energy by reducing reads and writes of the **Unified Buffer.** 



# **Benefit:** Maximizes computation done on a single piece of data element brought from memory.

| Ā          | Interf. 2%           | (4Kx256x32b =4 MiB) 6% | A          |
|------------|----------------------|------------------------|------------|
| M<br>port  | Control 2%           | Activation Pipeline 6% | M<br>port  |
| ddr3<br>3% | PCIe<br>Interface 3% | Misc. I/O 1%           | ddr3<br>3% |

#### TPU is based on the Systolic Array





#### Systolic data flow of the Matrix Multiply Unit.

Done

SW has the illusion that each 256B input is read at once, and they instantly update one location of each of 256 accumulator RAMs.

Similar to blood flow: heart -> many cells -> heart Memory: heart Data: blood PEs: cells

Figure 1. Basic principle of a systolic system.

H.T. Kung, "Why systolic architectures?" IEEE Computer 1982)

Source: In-datacenter Performance Analysis of a Tensor Processing Unit Jouppi et al, ISCA, 6/2017

#### Systolic arrays for DNN acceleration (Ex. TPU)



Ref. Azghadi2020 IEEE TRABS ON BIOMEDICAL CIRCUITS AND SYSTEMS

56-1

#### **NN Training Works with Low-precision FP**

#### **fp32: Single-precision IEEE Floating Point Format**



#### **fp16: Half-precision IEEE Floating Point Format**

|   | - | Expo | nent: | 5 bits |   | 4 |   | Ма | ntissa | (Sign | ificand | 5): 10 | bits |   |   |
|---|---|------|-------|--------|---|---|---|----|--------|-------|---------|--------|------|---|---|
| s | Е | Е    | Е     | Е      | Е | м | М | м  | м      | м     | м       | М      | м    | м | М |

Range: 10^-8 to 65504

#### **bfloat16: Brain Floating Point Format**

|   | - |   | E | xpone | nt: 8 t | oits |   |   |   | Manti | ssa (S | Signific | and): | 7 bits | <u> </u> |
|---|---|---|---|-------|---------|------|---|---|---|-------|--------|----------|-------|--------|----------|
| s | Е | Е | Е | Е     | Е       | Е    | Е | Е | М | М     | М      | М        | М     | М      | м        |

Range: (10<sup>-45</sup>) to (10<sup>38</sup>)

- Represent the same range of numbers of fp32 just at a much lower position.
- It turns out that we don't need all that precision for NN training, but we do actually need all the range.

#### NN Training Works with Low-precision FP

- One technique exploited by the new chips is using **lowprecision**, often fixed-point data, **eight bits** or even fewer, especially for inference.
- One of the major open questions in all of this as far as hardware accelerators are concerned is how far can you actually push this down without losing classification accuracy?
- Results from **Google, Intel, and others** show that such low-precision computations can be very powerful when the data is prepared correctly, which also opens opportunities for novel electronics.

#### What are the differences between the three TPUs



#### What are the differences between the three TPUs



## **TPU Performance on three Popular NNs**

- Multi-Layer Perceptrons (MLP)
- Convolutional Neural Networks (CNN)
- Recurrent Neural Networks (RNN)

| Name LOC |      |    | L    | Layers |      |       | Nonline ar    | Weichte | TPU Ops /        | TPU Batch | % of Deployed     |  |
|----------|------|----|------|--------|------|-------|---------------|---------|------------------|-----------|-------------------|--|
|          |      | FC | Conv | Vector | Pool | Total | function      | weignis | Weight Byte Size |           | TPUs in July 2016 |  |
| MLP0     | 100  | 5  |      |        |      | 5     | ReLU          | 20M     | 200              | 200       | 61%               |  |
| MLP1     | 1000 | 4  |      |        |      | 4     | ReLU          | 5M      | 168              | 168       |                   |  |
| LSTM0    | 1000 | 24 |      | 34     |      | 58    | sigmoid, tanh | 52M     | 64               | 64        | 2004              |  |
| LSTM1    | 1500 | 37 |      | 19     |      | 56    | sigmoid, tanh | 34M     | 96               | 96        | 2970              |  |
| CNN0     | 1000 |    | 16   |        |      | 16    | ReLU          | 8M      | 2888             | 8         | 50/               |  |
| CNN1     | 1000 | 4  | 72   |        | 13   | 89    | ReLU          | 100M    | 1750             | 32        | J70               |  |

#### Tensor Processing Unit (TPU) with MLP, CNN, RNN)

|                             |     |    |        |      |            | Die  |        |     |      |         | Benchmarked Servers |                                |       |      |       |
|-----------------------------|-----|----|--------|------|------------|------|--------|-----|------|---------|---------------------|--------------------------------|-------|------|-------|
| Model                       |     |    | MIL-   | מתד  | Me asure d |      | TOPS/s |     | CD/a | On-Chip | Diag                | DPAM Size                      | מרוד  | Meas | sured |
|                             |     | nm | IVII1Z | IDF  | Idle       | Busy | 8b     | FP  | GD/S | Memory  | Lies                |                                | IDF   | Idle | Busy  |
| Haswell<br>E5-2699 v3       | 662 | 22 | 2300   | 145W | 41W        | 145W | 2.6    | 1.3 | 51   | 51 MiB  | 2                   | 256 GiB                        | 504 W | 159W | 455W  |
| NVIDIA K80<br>(2 dies/card) | 561 | 28 | 560    | 150W | 25W        | 98W  |        | 2.8 | 160  | 8 MiB   | 8                   | 256 GiB (host)<br>+ 12 GiB x 8 | 1838W | 357W | 991 W |
| TPU                         | NA* | 28 | 700    | 75W  | 28W        | 40 W | 92     |     | 34   | 28 MiB  | 4                   | 256 GiB (host)<br>+ 8 GiB x 4  | 861 W | 290W | 384 W |

Benchmarked servers use Haswell CPUs, K80 GPUs, and TPUs. Haswell has 18 cores, and the K80 has 13 SMX processors.

Source: In-datacenter Performance Analysis of a Tensor Processing Unit, Jouppi et al, ISCA, 6/2017

#### **TPU Relative Performance/Watt**



Quantifying the performance of the TPU, our first machine learning chip: https://cloudplatform.googleblog.com/2017/04/quantifying-the-performance-of-the-TPU-our-first-machine-learning-chip.html

#### 61

#### **NVIDIA's Volta GPU is Specially Designed for Al**

- NVIDIA's Volta GPU is specially designed for ML, and it offers 100 TFLOPS of DL performance, according to the company.
- GPUs were built for graphics workloads and *evolved* for high performance computing and AI workloads
- While GPUs are used extensively for training, they're not really needed for inference



NVIDIA's Volta GPU architecture is specially designed for AI. (Image courtesy of NVIDIA.)

#### The HGX-2, announced at NVIDIA GTC May 2018

Multi-precision computing platform for scientific computing (high precision) and AI workloads (low precision).



#### **NVIDIA's GPU Performance**

30x Higher Throughput than CPU Server on Deep Learning Inference



#### At Facebook, for example, primary use case of GPUs is offline training rather than serving real-time data to USERS

#### **Offline training uses a mix of GPUs and CPUs**

| Service              | Resource                  | Training Frequency | Training Duration |
|----------------------|---------------------------|--------------------|-------------------|
| News Feed            | Dual-Socket CPUs          | Daily              | Many Hours        |
| Facer                | GPUs + Single-Socket CPUs | Every N Photos     | Few Seconds       |
| Lumos                | GPUs                      | Multi-Monthly      | Many Hours        |
| Search               | Vertical Dependent        | Hourly             | Few Hours         |
| Language Translation | GPUs                      | Weekly             | Days              |
| Sigma                | Dual-Socket CPUs          | Sub-Daily          | Few Hours         |
| Speech Recognition   | GPUs                      | Weekly             | Many Hours        |

#### TABLE II

FREQUENCY, DURATION, AND RESOURCES USED BY OFFLINE TRAINING FOR VARIOUS WORKLOAD

#### However, online training is CPU-heavy

| Services             | Relative Capacity | Compute           | Memory |
|----------------------|-------------------|-------------------|--------|
| News Feed            | 100X              | Dual-Socket CPU   | High   |
| Facer                | 10X               | Single-Socket CPU | Low    |
| Lumos                | 10X               | Single-Socket CPU | Low    |
| Search               | 10X               | Dual-Socket CPU   | High   |
| Language Translation | 1X                | Dual-Socket CPU   | High   |
| Sigma                | 1X                | Dual-Socket CPU   | High   |
| Speech Recognition   | 1X                | Dual-Socket CPU   | High   |

facebook research TABLE III RESOURCE REQUIREMENTS OF ONLINE INFERENCE WORKLOADS.

#### **GPUs & ASICs Renting Cost Per Hour**



## **Mobile AI-Chips**

## **Mobile Al-Chips**

- Much of the data captured by the smartphone, including images, video, and sound, is unstructured.
- Training and Inference are Two Vital Components of AI on Smartphones.
- Unlike structured data information with a degree of organization unstructured data makes compilation a time- and energy-consuming task.
- Huawei's Kirin 970 chipset comes with its own **neural processing unit (NPU).**
- Huawei has it own APIs that developers need to use to tap the power of the "neural" hardware.
- Google has it mobile AI framework TensorFlow Lite.



| 8-Core CPU          | 12-Core GPU                   |  |  |
|---------------------|-------------------------------|--|--|
| up to 2.4GHz        | Mali G72MP12                  |  |  |
| Kirin NPU           | Image DSP                     |  |  |
| 1.92T FP16 OPS      | 512bit SIMD                   |  |  |
| Hi-Sili             | con Al                        |  |  |
| Global-Mode Modem   | Dual Camera ISP               |  |  |
| 1.2Gbps@LTE Cat18   | with face & motion detection  |  |  |
| 4K Video            | HiFi Audio                    |  |  |
| нDR10               | 32bit / 384k                  |  |  |
| LPDDR 4X            | UFS 2.1                       |  |  |
| i7 Sensor Processor | Security Engine<br>inSE & TEE |  |  |

Source: Huawei, 2017

Huawei Kirin 970

#### **Summary of Mobile Al Chips**

|                | System-on-chip (SoC)  | A11 Bionic                           | A12 Bionic                          | Kirin 970                | Kirin980                                              |  |  |
|----------------|-----------------------|--------------------------------------|-------------------------------------|--------------------------|-------------------------------------------------------|--|--|
|                | Supplier              | Ap                                   | ple                                 | Hisil                    | licon                                                 |  |  |
|                | Released date         | 9.12                                 | .2018                               | 8.31.2018                |                                                       |  |  |
| Design         | 64 Bit                |                                      |                                     | Yes                      |                                                       |  |  |
|                | manufacturing process | 10 nm TSMC                           | 7nm TSMC                            | 10nm TSMC                | 7nm TSMC                                              |  |  |
|                | Transistors           | 4.3 billion                          | 6.9 billion                         | 5.5 billion              | 6.9 billion                                           |  |  |
|                | CPU Cores             | 2+4                                  | 2+4                                 | 4+4                      | 2+2+4                                                 |  |  |
|                | Performance CPU       | Monsoon                              | New CPU× 2 + 15%<br>performance     | Cortex-A73 × 2           | Cortex-A76 (2.6GHz) × 2 +<br>Cortex-A76 (1.92GHz) × 2 |  |  |
| СРО            | Efficiency CPU        | Mistral × 4                          | New CPU× 4 + 50%<br>efficiency      | Cortex-A53 × 4           | Cortex-A55 × 4                                        |  |  |
|                | Max Clock (GHz)       | 2.4                                  | N/A                                 | 2.4                      | 2.6                                                   |  |  |
| CDU            | GPU                   | Internally-designed GPU              | Internally-designed GPU             | Mali-G72 MP12            | Mali-G76                                              |  |  |
| GPO            | GPU Cores             | 3                                    | 4                                   | 12                       | 10                                                    |  |  |
|                | AI Processor          | 2-core Neural Engine                 | 8-core Neural Engine                | NPU                      | Dual NPU                                              |  |  |
| AI Accelerator | Performance           | 600 billion operations per<br>second | 5 trillion operations per<br>second | 2005 pictures per minute | 4500 pictures per minute                              |  |  |
|                | Ram Interface         | LPDDR4X                              | LPDDR4X                             | LPDDR4x                  | LPDDR4X                                               |  |  |
| Memory         | Ram Frequency         | N/A                                  | N/A                                 | 1833                     | 2133                                                  |  |  |
|                | Max Bandwidth         | N/A                                  | N/A                                 | 29.9                     | 34.1                                                  |  |  |

#### **Summary of Mobile AI Chips**

|                | System-on-chip (SoC)  | A11 Bionic                        | A12 Bionic                          | Kirin 970                 | Kirin980                                              |  |  |
|----------------|-----------------------|-----------------------------------|-------------------------------------|---------------------------|-------------------------------------------------------|--|--|
|                | Supplier              | Ap                                | ple                                 | Hisil                     | licon                                                 |  |  |
|                | Released date         | 9.12                              | .2018                               | 8.31.2018                 |                                                       |  |  |
| Design         | 64 Bit                |                                   |                                     | Yes                       |                                                       |  |  |
|                | manufacturing process | 10 nm TSMC                        | 7nm TSMC                            | 10nm TSMC                 | 7nm TSMC                                              |  |  |
|                | Transistors           | 4.3 billion                       | 6.9 billion                         | 5.5 billion               | 6.9 billion                                           |  |  |
| CPU            | CPU Cores             | 2+4                               | 2+4                                 | $\angle_{0} + \angle_{0}$ | 2+2+4                                                 |  |  |
|                | Performance CPU       | Monsoon                           | New CPU× 2 + 15%<br>performance     | Cortex-A73 × 2            | Cortex-A76 (2.6GHz) × 2 +<br>Cortex-A76 (1.92GHz) × 2 |  |  |
|                | Efficiency CPU        | Mistral × 4                       | New CPU× 4 + 50%<br>efficiency      | Cortex-A53 × 4            | Cortex-A55 × 4                                        |  |  |
|                | Max Clock (GHz)       | 2.4                               | N/A                                 | 2.4                       | 2.6                                                   |  |  |
| CDU            | GPU                   | Internally-designed GPU           | Internally-designed GPU             | Mali-G72 MP12             | Mali-G76                                              |  |  |
| GPU            | CDU Corres            | 2                                 | ,                                   | 10                        | 10                                                    |  |  |
|                | AI Processor          | 2-core Neural Engine              | 8-core Neural Engine                | NPU                       | Dual NPU                                              |  |  |
| AI Accelerator | Performance           | 600 billion operations per second | 5 trillion operations per<br>second | 2005 pictures per minute  | 4500 pictures per minute                              |  |  |
|                | Ram Interface         | LPDDR4X                           | LPDDR4X                             | LPDDR4x                   | LPDDR4X                                               |  |  |
| Memory         | Ram Frequency         | N/A                               | N/A                                 | 1833                      | 2133                                                  |  |  |
|                | Max Bandwidth         | N/A                               | N/A                                 | 29.9                      | 34.1                                                  |  |  |
|                |                       |                                   |                                     |                           |                                                       |  |  |
# **Edge and IoT AI-Chips**

~Processing Real-Time Data~

# **Edge Computing: Edge AI Chip**

• The need for no latency, higher security, faster computing, and less dependence on connectivity will drive the adoption of devices that

On-device approach helps reduce latency for critical applications, lower dependence on the cloud, and better manage the massive data being generated by the IoT device.



Illustration of an Edge Computing Architecture

### **Examples of Edge AI Applications**



Source: CBINSIGHTS 2018

### **Examples of Edge AI Applications**



**Combining a 4K sensor with HDR and Intelligent Imaging Uses on-device vision processing to watch for motion, distinguish family members, and send alerts only if someone is not recognized or** doesn't fit pre-defined parameters.



https://nest.com/cameras/nest-cam-iq-indoor/overview/

### Apple, Intel, and Google Edge Al-Chips

- <u>Apple</u> released its A11 chip with a "neural engine" for iPhone 8 and X. Apple claims it can perform machine learning tasks at up to 600B operations per second.
  - It powers new iPhone features like FaceID, which scans a user's face with an invisible spray of light, without uploading or storing any user data (or their face) in the cloud.
- <u>Intel</u> released an on-device vision processing chip called Myriad X (initially developed by Movidius, which Intel acquired in 2016).
  - Myriad X promises to take on-device deep learning beyond smartphones to devices like baby monitors and drones
- <u>**Google</u>** proposed a similar concept with its "federated learning" approach, where some of the machine learning "training" can happen on your device. It's testing out the feature in **Google keyboard**.</u>
- Al on the edge reduces latency. But unlike the cloud, edge has storage and processing constraints.

# **Healthcare AI-Chips**

## **Healthcare Al-Chips**





SpiNNaker CPU

#### **Applications/Research Areas**

- Neuroscience: neuroinformatics; brain simulation
- Medicine: medical informatics; early diagnosis; personalized treatment
- Future computing: interactive supercomputing; neuromorphic computing

#### SpiNNaker-1 machine



Many-core system 0.5 (1.0) Million ARM cores Real-time simulator

#### BrainScaleS-1 machine



Physical model system 4M neurons, 1B plastic syn. Accelerated emulator

#### SpiNNaker-2 prototype



144 Cortex M4F per chip 36 GIPS/Watt per chip x10 with constant power

#### BrainScaleS-2 prototype



On-chip plasticity processor Flexible hybrid plasticity Active dendritic spatial structure

https://www.humanbrainproject.eu/en/

## **Healthcare Al-Chips**







#### SpiNNaker-1 machine



Many-core system 0.5 (1.0) Million ARM cores Real-time simulator

#### BrainScaleS-1 machine



Physical model system 4M neurons, 1B plastic syn. Accelerated emulator

#### SpiNNaker-2 prototype



144 Cortex M4F per chip 36 GIPS/Watt per chip x10 with constant power

#### BrainScaleS-2 prototype



On-chip plasticity processor Flexible hybrid plasticity Active dendritic spatial structure

https://www.humanbrainproject.eu/en/

## **Healthcare Al-Chips**

### **The Human Brain Project**

An EU ICT Flagship project (€1B budget) 80 partner institutes, led by Henry Markram, EPFL



The basic idea of the Human Brain Project From Science to Infrastructures to Science and Innovation



https://www.humanbrainproject.eu/en/

..our work -Homeostatic Neuromorphic System

\*this is not the scope of this talk

### Our work - Homeostatic Neuromorphic System

#### Architecture: Spike Packet Format

| $\stackrel{\text{2 bits}}{\longleftrightarrow}$ | → 3 bits     | ← 9 bits | ← 6 bits → | < 8 bits  |
|-------------------------------------------------|--------------|----------|------------|-----------|
| Туре                                            | [Fault_flag] | XYZs     | Timestamp  | Neuron ID |

- Type: It is the header of the packet indicating this packet is either for configuration or spike: '00': system configuration; '11', spike packet.
- [Fault\_Flag]: This is only used for the fault-tolerant multicast routing algorithm
- *XYZ<sub>s</sub>*: It is the address of the source neuron tile, used for spike routing.
- Timestamp: In spiking neuron network, the time of the generated spike is used to encode the information.
- Neuron ID: this is the identifier of the pre-synaptic neuron.

Table 5: Power consumption of the KMCR and FTSP-KMCR under the benchmarks.

| System                          | KMCR      |       | FTSP-KMCR |       |
|---------------------------------|-----------|-------|-----------|-------|
| Oystern                         | Inv. Pen. | Wis.  | Inv. Pen. | Wis.  |
| Area ( <i>mm</i> <sup>2</sup> ) | 0.102     | 0.346 | 0.108     | 0.365 |
| Power (mW)                      | 10.13     | 34.20 | 10.64     | 35.92 |

#### Table 6: MC-3DR Hardware Complexity Evaluation and Comparison.

| System                                       | Topology    | Area                       | Power |
|----------------------------------------------|-------------|----------------------------|-------|
| System                                       | ropology    | ( <i>mm</i> <sup>2</sup> ) | (mW)  |
| EMBRACE router [Carrillo2012], 90nm          | 2D Mesh     | 0.056                      | 1.72  |
| HANA tile router [Liu2016], 90nm             | 2D Mesh     | 0.156                      | 28.12 |
| H-NoC cluster router [Crrillo2012HNoC], 65nm | Star-Mesh   | 0.022                      | 1.19  |
| Clos-NoC spine switch [Hojabr2017], 45nm     | Custom Clos | 0.076                      | -     |
| Clos-NoC leaf switch [Hojabr2017], 45nm      | Custom Clos | 0.061                      | -     |
| MC-3DR router, 45nm (this work)              | 3D Mesh     | 0.031                      | 1.66  |



#### 3DNoC-SNN system architecture high-level view.

#### Architecture: Spiking Neural Processing Core



5bits synapse register format

| Input type [0] Synaptic streng |
|--------------------------------|
|--------------------------------|

32bits neuron register format

 Membrane potential [0:7]
 Threshold [8:15]
 Leaky value [16:23]
 Reset value [24:31]

 Figure 12: Spiking
 Neuron Processing Core (SNPC) architecture

### Our work - Homeostatic Neuromorphic System

Average spike latency over varying the injection rate



#### Figure 15: Average packet latency evaluation result

oThe H. Vu,Yuichi Okuyama, Abderazek Ben Abdallah, "Comprehensive Analytic Performance Assessment and K-means based Multicast Routing Algorithms and Architecture for 3D-NoC of Spiking Neurons," ACM Journal on Emerging Technologies in Computing Systems (JETC), Special Issue on Hardware and Algorithms for Learning On-a-chip for Energy-Constrained On-Chip Machine Learning, Vol. 15, No. 4, Article 34, October 2019. doi: 10.1145/3340963

oThe H. Vu, Ogbodo Mark Ikechukwu, and Abderazek Ben Abdallah, "Fault-tolerant Spike Routing Algorithm and Architecture for Three Dimension NoC-Based Neuromorphic Systems", *IEEE Access, vol. 7, pp. 90436-90452, 2019.* 

### Agenda

- Fundamental Trends
- AI The 4<sup>th</sup> Industrial Revolution
- Survey of Al Hardware

Cloud AI Hardware, Chips

- ➢ Mobile AI Chips
- Edge and IoT AI Chips
- Healthcare AI Chips

## Conclusions

### Conclusions

- DNNs are a key component in the AI revolution.
- Efficient processing of DNNs is an important area of research with many promising opportunities for innovation at various levels of hardware design, including algorithm co-design
- It's important to consider a comprehensive set of metrics when evaluating different DNN solutions: accuracy, speed, energy, and cost

### Conclusions

### **Memory access in AI-Chip is the bottleneck**

Worst case: ALL memory R/W are DRAM accesses
 Ex. AlexNet [NIPS 2012] has 724M MACs → 2896M DRAM accesses required

### <u>Possible HW/SW techniques to cope with the</u> <u>memory access problem:</u>

### \*Advanced Storage Technology

- Embedded DRAM (eDRAM)  $\rightarrow$  Increase on-chip storage capacity
- 3D Stacked DRAM  $\rightarrow$  Increase memory bandwidth
- Use memristors as programmable weights (resistance)

#### **\***Reduce size of operands for storage/compute

- Floating point  $\rightarrow$  Fixed point
- Bit-width reduction

#### **\***Reduce number of operations for storage/compute

• Network Pruning; Compact Network Architectures

### References

- 1. Dally, W. February 9, 2016. High Performance Hardware for Machine Learning, Cadence ENN Summit
- 2. Michael Alba, The Great Debate of Al Architecture, April 2018 [www.engineering.com].
- 3. [Ros15a] Ross, J., Jouppi, N., Phelps, A., Young, C., Norrie, T., Thorson, G., Luu, D., 2015. Neural Network Processor, Patent Application No. 62/164,931.
- 4. Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, Huazhong Yang, Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, ACM International Symposium on FPGA, 2016
- 5. Dally, NIPS'2016 workshop on Efficient Methods for Deep Neural Networks
- 6. https://itcafe.hu/dl/cnt/2017-12/142233/idc white paper.pdf
- 7. Top AT Trends to watch in 2018, CBINSIGHTS, 2018
- 8. What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial for Beginners | Simplilearn https://www.youtube.com/watch?v=E8n\_k6HNAgs
- 9. TensorFlow in 5 Minutes (tutorial) <u>https://www.youtube.com/watch?v=2FmcHiLCwTU</u>
- 10. Hardware Architectures for Deep Neural Networks, ISCA Tutorial June 24, 2017, http://eyeriss.mit.edu/tutorial.html
- 11. Quantifying the performance of the TPU, our first machine learning chip: <u>https://cloudplatform.googleblog.com/2017/04/quantifying-the-performance-of-the-TPU-our-first-machine-learning-chip.html</u>
- 12. https://streamable.com/
- 13. Abderazek Ben Abdallah, "Neuro-inspired Computing Systems & Applications", Keynote Speech, 2018 International Conference on Intelligent Autonomous Systems (ICoIAS'2018), March 1-3, 2018, Singapore.[slides.pdf]
- 14. The H. Vu, Ryunosuke Murakami, Yuichi Okuyama, Abderazek Ben Abdallah, "<u>Efficient Optimization and Hardware Acceleration</u> of <u>CNNs</u> towards the Design of a Scalable Neuro-inspired Architecture in Hardware", Proc. of the IEEE International Conference on Big Data and Smart Computing (BigComp-2018), pp. 326-332, January 15-18, 2018, Shanghai, China. [paper.pdf].[slides.pdf]
- 15. Ryunosuke Murakami, Yuichi Okuyama, Abderazek Ben Abdallah, "Animal Recognition and Identification with Deep Convolutional Neural Networks for Farm Monitoring", Information Processing Society Tohoku Branch Conference, Feb. 10, 2018 [slides.pdf]
- 16. Yuji Murakami, Yuichi Okuyama, Abderazek Ben Abdallah, "SRAM Based Neural Network System for Traffic-Light Recognition in Autonomous Vehicles", Information Processing Society Tohoku Branch Conference, Feb. 10, 2018. [slides.pdf]
- Kanta Suzuki, Yuichi Okuyama, Abderazek Ben Abdallah, "Hardware Design of a Leaky Integrate and Fire Neuron Core Towards the Design of a Low-power Neuro-inspired Spike-based Multicore SoC", Information Processing Society Tohoku Branch Conference, Feb. 10, 2018. [slides.pdf]
- 18. Spiking Neuron Models Single Neurons, Populations, Plasticity Wulfram Gerstner and Werner M. Kistler Cambridge University Press, 2002