Google’s devoted TensorFlow processor, or TPU, crushes Intel, Nvidia in inference workloads

A number of years in the past, Google started working by itself customized software program for machine studying and synthetic intelligence workloads, dubbed TensorFlow. Final 12 months, the corporate introduced that it had designed its personal tensor processing unit (TPU), an ASIC designed for prime throughput of low-precision arithmetic. Now, Google has launched some efficiency knowledge for his or her TPU and the way it compares to Intel’s Haswell CPUs and Nvidia’s Ok80 (Kepler-based) knowledge heart twin GPU.

Earlier than we dive into the info we have to discuss concerning the workloads Google is discussing. All of Google’s benchmarks measure inference efficiency versus preliminary neural community coaching. Nvidia has a graphic that summarizes the differences between the 2:


Click on to enlarge.

Instructing a neural community what to acknowledge and how you can acknowledge it’s known as coaching, and these workloads are nonetheless sometimes run on CPUs or GPUs. Inference refers back to the neural community’s skill to use what it discovered from coaching. Google makes it clear that it’s solely concerned about low-latency operations and that it’s imposed strict standards for responsiveness on the benchmarks we’ll focus on beneath.

Google’s TPU design, benchmarks

The primary a part of Google’s paper discusses the assorted sorts of deep neural networks it deploys, the precise benchmarks it makes use of, and affords a diagram of the TPU’s bodily format, pictured beneath. The TPU is particularly designed for Eight-bit integer workloads and prioritizes persistently low latency over uncooked throughput (each CPUs and GPUs are inclined to prioritize throughput over latency, notably GPUs).


Click on to enlarge

Google writes (PDF): “Somewhat than be tightly built-in with a CPU, to cut back the probabilities of delaying deployment, the TPU was designed to be a coprocessor on the PCIe I/O bus, permitting it to plug into present servers simply as a GPU does. Furthermore, to simplify design and debugging, the host server sends TPU directions for it to execute fairly than fetching them itself. Therefore, the TPU is nearer in spirit to an FPU (floating-point unit) coprocessor than it’s to a GPU.”


Click on to enlarge

Every TPU additionally has an off-chip 8GiB DRAM pool, which Google calls Weight Reminiscence, whereas intermediate outcomes are held in a 24MiB pool of on-chip reminiscence (that’s the Unified Buffer within the diagram above). The TPU has a four-stage pipeline and executes CISC directions, with some directions taking hundreds of clock cycles to execute versus the everyday RISC pipeline of 1 clock cycle per pipeline stage. The desk beneath reveals how the E5-2699v3 (Haswell), Nvidia Ok80, and TPU evaluate in opposition to one another in varied metrics.


Click on to enlarge

Earlier than we hit the benchmark outcomes, there are some things we have to be aware. First, Turbo mode and GPU Enhance have been disabled for each the Haswell and Nvidia GPUs, to not artificially tilt the rating in favor of the TPU, however as a result of Google’s knowledge facilities prioritize dense packing over uncooked efficiency. Greater turbo clock charges for the v3 Xeon are depending on not utilizing AVX, which Google’s neural networks all have a tendency to make use of. As for Nvidia’s Ok80, the take a look at server in query deployed 4 Ok80 playing cards with two GPUs per card, for a complete of eight GPU cores. Packed that tightly, the one method to benefit from the GPU’s enhance clock with out inflicting an overheat would have been to take away two of the Ok80 playing cards. Because the clock frequency enhance isn’t practically as potent as doubling the full variety of GPUs within the server, Google leaves enhance disabled on these server configurations.

Google’s benchmark figures all use the roofline efficiency mannequin. The benefit of this mannequin is that it creates an intuitive picture of total efficiency. The flat roofline represents theoretical peak efficiency, whereas the assorted knowledge factors present real-world outcomes.

On this case, the Y-axis is integer operations per second, whereas the “Operational Depth” X-axis corresponds to integer operations per byte of weights learn (emphasis Google’s). The hole between an software’s noticed efficiency and the curve immediately above it reveals how a lot extra efficiency could be gained if the benchmark was higher optimized for the structure in query, whereas knowledge factors on the slanted portion of the roofline point out that the benchmark is operating into reminiscence bandwidth limitations. The slideshow beneath reveals Google’s ends in varied benchmarks for its CPU, GPU, and TPU checks. As all the time, every slide could be clicked to open a bigger picture in a brand new window.

Google’s TPU isn’t only a excessive efficiency engine; it affords considerably improved efficiency per watt as effectively, each within the unique TPU and for improved variants Google has modeled (TPU’).


Click on to enlarge

The chief limiting issue between Google’s TPU and better efficiency is reminiscence bandwidth. Google’s fashions present TPU efficiency enhancing 3x if reminiscence bandwidth is elevated 4x over present designs. No different set of enhancements, together with clock fee enhancements, bigger accumulators, or a mix of a number of components has a lot of an affect on efficiency.

The final part of Google’s paper is devoted to dispelling varied fallacies and correcting misunderstandings, quite a few which relate to the selection of the Ok80 GPU. One part is especially price quoting:

Fallacy: CPU and GPU outcomes could be similar to the TPU if we used them extra effectively or in comparison with newer variations.

We initially had Eight-bit outcomes for only one DNN on the CPU, as a result of vital work to make use of AVX2 integer help effectively. The profit was ~three.5X. It was much less complicated (and house) to current all CPU ends in floating level, fairly than having one exception, with its personal roofline. If all DNNs had related speedup, efficiency/Watt ratio would drop from 41-83X to 12-24X. The brand new 16-nm, 1.5GHz, 250W P40 datacenter GPU can carry out 47 Tera Eight-bit ops/sec, however was unavailable in early 2015, so isn’t modern with our three platforms. We can also’t know the fraction of P40 peak delivered inside our inflexible time bounds. If we in contrast newer chips, Part 7 reveals that we might triple efficiency of the 28-nm, zero.7GHz, 40W TPU simply through the use of the Ok80’s GDDR5 reminiscence (at a price of an extra 10W).

This sort of announcement isn’t the kind of factor Nvidia goes to be completely happy to listen to. To be clear, Google’s TPU outcomes as we speak are relevant to inference workloads, not the preliminary process of coaching the neural community — that’s nonetheless executed on GPUs. However, with respect to Nvidia and AMD each, we’ve additionally seen this type of cycle play out earlier than. As soon as upon a time, CPUs have been the unquestioned kings of cryptocurrency mining. Then, as problem rose, GPUs grew to become dominant, due to vastly increased hash charges. In the long term, nonetheless, customized ASICs took over the market.

Each AMD and Nvidia just lately added (Nvidia) or introduced (AMD) help for Eight-bit operations to enhance whole GPU throughput in deep studying and AI workloads, however it can take vital enhancements over and above these steps to handle the benefit ASICs would possess if they begin shifting into these markets. That’s to not say we count on customized ASIC designs to personal the market — Google and Microsoft might be able to afford to construct their very own customized , however most prospects gained’t have the funds or experience to take that on.