paper showed that inference occurs by
interpolating between, or extrapolating out
from points on the fitted surface. In the data
center, large numbers of inference tasks can
be running at the same time. Many of these
inference tasks, such as speech recognition,
also need to run in real-time.
Speech as a bellwether
Due to its explosive growth, speech
recognition is becoming a defining
benchmark for machine learning in the
data center. It is clearly a computationally
intensive real-time task (Figure 1). In May
2016, one in five searches on a mobile app
in the U.S. running Android were voice
searches. In June 2015, Siri handled more
than 1 billion requests per week through
speech. In May 2016, one in four, or 25
percent, of searches performed on the
Windows 10 taskbar are voice searches,
according to Bing. Mary Meeker, partner at
Kleiner Perkins Caufield & Byers suggested
that by 2020 half of all web searches will be
made using voice and image search (Figure
2). It is believed that increased accuracy of
speech recognition is responsible for the
significant uptick in usage.
The accelerator approach
The current accelerator approach to training
and inference encompasses graphics processing
units (GPUs), field-programmable gate arrays
(FPGAs) and custom designed chips.
NVIDIA is the leader in GPU accelerated
machine learning. Ian Buck also made the
observation that “AI is out-innovating the
central processing unit (CPU) roadmap
by up to 10x a year in some cases.” This
explains why NVIDIA has made such
significant investments in accelerators for
both training and inference (Figure 3).
For training, NVIDIA created the
P100 Pascal GPU and sells systems such
as the DGX-1 supercomputer to speed
training workloads. Performance is
delineated according to precision from
double-precision (64-bit) down to FP16
(16-bit floating-point arithmetic).
The benefit of lower precision FP16
arithmetic is that 2x the number of arithmetic
per unit time.
of the memory
subsystem can be
increased as 2x
as the number
datum that can
be brought out
of memory per
cache line fetch.
take this low
to the resolution
of 8-bit arithmetic. The benefit is that 4x the
number of INT8 operations can be processed
per unit time while also gaining the benefit
of more effective use of memory bandwidth.
To exploit this 4x and more performance
opportunity, NVIDIA created the NVIDIA
P4 and P40 Tesla GPUs plus the TensorRT
high-performance inference engine.
The TensorRT library has been designed
to migrate trained neural networks to the
P4 and P40 GPUs and to utilize INT8
arithmetic. The challenge with INT8
arithmetic lies in the discretization of the
multidimensional surface used for inference.
Is such low-resolution arithmetic accurate
enough for inferencing? Said another way,
is the INT8 arithmetic accurate enough to
represent the fitted multidimensional surface,
or do discretization artifacts destroy the
accuracy? To address this concern, TensorRT
has the capability to run the INT8 network
on a data set (say, the training set) so the
user can evaluate any changes in accuracy or
behavior (Figure 4).
The NVIDIA P4 is a half-height card
designed to fit into high-density “scale out”
server systems. The following shows the size
of a P4 card relative to a pencil (Figure 5).
Each card delivers 5. 5 TeraFLOPS of peak
single precision and 22 TOPS (Trillion OP/s)
Peak INT8 performance as well as a Tesla P4
inside a high-density tray.
The NVIDIA Tesla P40 is designed for the
highest throughput for “scale up” servers.
The card delivers 12 TeraFLOPs of peak
single-precision performance and 48 TOPS
of peak INT8 performance.
FPGAs and specialized processors
The application of FPGAs and custom
designed chips to machine learning is still
in the nascent stages. With GPU, CPU, and
FPGA technologies in active development,
Figure 3: Training: comparing to Kepler GPU in 2013 using Caffe, Inference:
comparing img/sec/watt to CPU: Intel E5-2697v4 using AlexNet. (Source: NVIDIA)
Figure 2: Source 2016 Internet Trends Report from Kleiner
Perkins Caufield & Byers