# Albert Gural

BSEE, California Institute of Technology, 2016

MSEE, Stanford University, 2018

**Email:** agural (AT) stanford (DOT) edu

$[hdcolor #8c1515$]

# Hardware-Algorithm Co-design for Emerging Machine Learning Accelerators

$[/hdcolor$]

Deep neural networks (DNNs) have recently seen a resurgence in popularity due to the increased availability of data and capability of compute. These modern advancements allow DNNs to tackle previously intractable real-world decision problems. To continue enabling these advancements - and to enable them in practice, such as inference on edge devices - we need to continue targeting improvements to the underlying compute capabilities. However, rather than focus on compute hardware in isolation of the algorithmic applications, a co-design approach, where knowledge of both domains is used, can lead to more optimal designs.

For applications involving small microcontrollers, a key hardware constraint is available memory. To maximize DNN performance, it is important to design algorithms that are as memory-efficient as possible (within reason). For example, in [1], we show that memory-optimized convolutions for deep CNNs can actually be made to fit in the 2KB SRAM of an Arduino, achieving state-of-the-art accuracy on a small image classification task (MNIST).

For latency-critical applications, such as self-driving cars, we instead want to optimize for inference compute time on specialized hardware (currently, GPUs or FPGAs). An understanding of hardware limitations suggests useful properties for inference algorithms to have: fixed point calculations with symmetric uniform quantization, per-tensor scale factors, and power-of-two scaling. In [2] we show improved methods of training popular DNNs with these difficult constraints, thereby enabling efficient hardware inference.

For edge applications with large DNNs, DNN weight movement begins to dominate energy costs. In-memory compute (IMC) offers an elegant solution to the problem by requiring nearly no weight movement - do computations where the weights are stored. However, as DNN sizes grow, the chip area required to store these weights becomes a problem. One potential solution is to use emerging nonvolatile memory (NVM) such as resistive RAM (RRAM), which promises high spatial density. To use RRAM, however, we need to understand its non-idealities and their effects on DNN accelerators designed around them.

[1] Gural, Albert, and Boris Murmann. “Memory-optimal direct convolutions for maximizing classification accuracy in embedded applications.” International Conference on Machine Learning. 2019.

[2] Jain, Sambhav R., et al. “Trained uniform quantization for accurate and efficient neural network inference on fixed-point hardware.” arXiv preprint arXiv:1903.08066 (2019).