Public Beta: Google’s Scalable 100+ PetaFLOP Supercomputers For Machine Learning: Cloud Tensor Processing Unit (TPU) Pods

(a full Cloud TPU v3 Pod — right-click to “view image” in full size)

To accelerate the largest-scale machine learning (ML) applications deployed today and enable rapid development of the ML applications of tomorrow, Google created custom silicon chips called Tensor Processing Units (TPUs). When assembled into multi-rack ML supercomputers called Cloud TPU Pods, these TPUs can complete ML workloads in minutes or hours that previously took days or weeks on other systems. Today, for the first time, Google Cloud TPU v2 Pods and Cloud TPU v3 Pods are publicly available in beta to help ML researchers, engineers, and data scientists iterate faster and train more capable machine learning models.

Delivering business value

Google Cloud is committed to providing a full spectrum of ML accelerators, including both Cloud GPUs and Cloud TPUs. Cloud TPUs offer highly competitive performance and cost, often training cutting-edge deep learning models faster while delivering significant savings. If your ML team is building complex models and training on large data sets, we recommend that you evaluate Cloud TPUs whenever you require:

Shorter time to insights—iterate faster while training large ML models
Higher accuracy—train more accurate models using larger datasets (millions of labeled examples; terabytes or petabytes of data)
Frequent model updates—retrain a model daily or weekly as new data comes in
Rapid prototyping—start quickly with our optimized, open-source reference models in image segmentation, object detection, language processing, and other major application domains

While some custom silicon chips can only perform a single function, TPUs are fully programmable, which means that Cloud TPU Pods can accelerate a wide range of state-of-the-art ML workloads, including many of the most popular deep learning models. For example, a Cloud TPU v3 Pod can train ResNet-50 (image classification) from scratch on the ImageNet dataset in just two minutes or BERT (NLP) in just 76 minutes.

Cloud TPU customers see significant speed-ups in workloads spanning visual product search, financial modeling, energy production, and other areas. In a recent case study, Recursion Pharmaceuticals iteratively tests the viability of synthesized molecules to treat rare illnesses. What took over 24 hours to train on their on-prem cluster completed in only 15 minutes on a Cloud TPU Pod.

What’s in a Cloud TPU Pod

A single Cloud TPU Pod can include more than 1,000 individual TPU chips which are connected by an ultra-fast, two-dimensional toroidal mesh network, as illustrated below. The TPU software stack uses this mesh network to enable many racks of machines to be programmed as a single, giant ML supercomputer via a variety of flexible, high-level APIs.

The latest-generation Cloud TPU v3 Pods are liquid-cooled for maximum performance, and each one delivers more than 100 petaFLOPs of computing power. In terms of raw mathematical operations per second, a Cloud TPU v3 Pod is comparable with a top 5 supercomputer worldwide (though it operates at lower numerical precision).

It’s also possible to use smaller sections of Cloud TPU Pods called “slices.” We often see ML teams develop their initial models on individual Cloud TPU devices (which are generally available) and then expand to progressively larger Cloud TPU Pod slices via both data parallelism and model parallelism to achieve greater training speed and model scale.

You can learn more about the underlying architecture of TPUs in this blog post or this interactive website, and you can learn more about individual Cloud TPU devices and Cloud TPU Pod slices here.

Getting started

It’s easy and fun to try out a Cloud TPU in your browser right now via this interactive Colab that enables you to apply a pre-trained Mask R-CNN image segmentation model to an image of your choice. You can learn more about image segmentation on Cloud TPUs in this recent blog post.

Next, we recommend working through our Cloud TPU Quickstart and then experimenting with one of the optimized and open-source Cloud TPU reference models listed below. We carefully optimized these models to save you time and effort, and they demonstrate a variety of Cloud TPU best practices. Benchmarking one of our official reference models on a public dataset on larger and larger pod slices is a great way to get a sense of Cloud TPU performance at scale.

Image classification

ResNet (tutorial, code, blog post)
AmoebaNet-D (tutorial, code)
Inception (tutorial, code)

Mobile image classification

MnasNet (tutorial, code, blog post)
MobileNet (code)

Object detection

RetinaNet (tutorial, code, blog post)
TensorFlow Object Detection API (blog post, tutorial)

Image segmentation

Mask R-CNN (tutorial, code, blog post, interactive Colab)
DeepLab (tutorial, code, blog post, interactive Colab)

Natural language processing

BERT (code, interactive Colab)
Transformer (tutorial, Tensor2Tensor docs)
Mesh TensorFlow (paper, code)
QANet (code)
Transformer-XL (code)

Speech recognition

ASR Transformer (tutorial)
Lingvo (code)

Generative Adversarial Networks

Compare GAN library, including a reimplementation of BigGAN (blog post, paper, code)
DCGAN (code)

[Google Sales Pitch:]

After you work with one of the above reference models on Cloud TPU, our performance guide, profiling tools guide, and troubleshooting guide can give you in-depth technical information to help you create and optimize machine learning models on your own using high-level TensorFlow APIs. Once you’re ready to request a Cloud TPU Pod or Cloud TPU Pod slices to accelerate your own ML workloads, please contact a Google Cloud sales representative.

from: https://cloud.google.com/blog/products/ai-machine-learning/googles-scalable-supercomputers-for-machine-learning-cloud-tpu-pods-are-now-publicly-available-in-beta