CMP304-R1: AWS infrastructure for large-scale training at Facebook AI

AWS re:Invent 2019 - Podcast autorstwa AWS

Kategorie:

In this session, the Facebook AI team discusses its major machine learning models and workloads and the infrastructure challenges it faced with large-scale distributed training. They share details of how they tested these ML workloads on AWS infrastructure and the results of this benchmarking. Then we discuss how the deep breadth of AWS infrastructure for ML workloads in compute, networking, and storage helps address large-scale ML challenges. Specifically, we dive deep into the AWS machine learning stack to choose the right Amazon EC2 platform to fit your ML workload while leveraging 100 Gbps networking and high-performance file systems to efficiently scale from a single GPU to hundreds or thousands.

Visit the podcast's native language site