Microsoft Research Webinar | ZeRO & Fastest BERT: Increasing the scale and speed of deep learning training in DeepSpeed
Microsoft Research Webinar registration page banner image

Microsoft Research Webinar Series

Register for the webinar

Complete the form below and receive an email with a link to the presentation

*required fields

Microsoft Research Webinar Series

Available on-demand. Register now.

ZeRO & Fastest BERT: Increasing the scale and speed of deep learning training in DeepSpeed

The latest trend in AI is that larger natural language models provide better accuracy; however, larger models are difficult to train because of cost, time, and ease of code integration. With the goal of advancing large model training by improving scale, speed, cost, and usability for model developers across the world, Microsoft made the DeepSpeed library open source in February of 2020.

In this webinar, the DeepSpeed team will discuss what DeepSpeed is, how to use it with your existing PyTorch models, and advancements in the ZeRO optimizer that are central to supporting training of 100–200 billion parameter models and higher. In addition, the team will present deep-dive results on how they were able to obtain the world record for fastest BERT training.

DeepSpeed can efficiently train models with 100–200 billion parameters up to 10 times faster than state-of-the-art via the use of a memory optimization system called ZeRO (Zero Redundancy Optimizer). ZeRO is a parallelized optimizer that greatly reduces the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained. Researchers used these breakthroughs to create Turing Natural Language Generation (Turing-NLG), one of the largest publicly known language models at 17 billion parameters.

DeepSpeed recently obtained the fastest BERT training record of 44 minutes on 1024 NVIDIA V100 GPUs. This is a 34% improvement over the best published result, and it does not come at the cost of excessive hardware resources but is a result of improved software efficiency. DeepSpeed can attain a staggering 64 teraflops of single GPU performance on a NVIDIA V100 GPU, which is over 50% of the hardware peak.

Together, you will explore:

  • DeepSpeed features, optimizations for speed and scale, and a roadmap for the future
  • How to use DeepSpeed to train your own model and other popular models like BERT and GPT-2
  • A deep dive into technology behind the ZeRO optimizer and upcoming features
  • How we achieved the world record for BERT training using this technology

DeepSpeed is a group of system researchers and engineers who are enthusiastic about performance optimization of large-scale systems. Presenters in this webinar include: Principal Research Manager Yuxiong He, researcher Samyam Rajbhandari, researcher Jeff Rasley, and researcher Tunji Ruwase.

*This on-demand webinar features a previously recorded Q&A session and open captioning.