Microsoft Research Webinar | Domain-specific language model pretraining for biomedical natural language processing
Microsoft Research Webinar registration page banner image

Microsoft Research Webinar Series

Register for the webinar

Complete the form below and receive an email with a link to the presentation

*required fields

Microsoft Research Webinar Series

Available on-demand. Register now.

Domain-specific language model pretraining for biomedical natural language processing

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general-domain corpora, such as in newswire and web text. Biomedical text is very different from general-domain text, yet biomedical NLP has been relatively underexplored. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models.

In this webinar, Microsoft researchers Hoifung Poon, Senior Director of Biomedical NLP, and Jianfeng Gao, Distinguished Scientist, will challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models.

You will begin with understanding how biomedical text differs from general-domain text and how biomedical NLP poses substantial challenges that are not present in mainstream NLP. You will also learn about the two paradigms for domain-specific language model pretraining and see how pretraining from scratch significantly outperforms mixed-domain pretraining in a wide range of biomedical NLP tasks. Finally, find out about our comprehensive benchmark and leaderboard created specifically for biomedical NLP, called BLURB, and see how our biomedical language model, PubMedBERT, sets a new state of the art.

Together, you'll explore:

  • How biomedical NLP differs from mainstream NLP
  • A shift in approach to pretraining language models for specialized domains
  • BLURB: a comprehensive benchmark and leaderboard for biomedical NLP
  • PubMedBERT: the state-of-the-art biomedical language model pretrained from scratch on biomedical text

Hoifung Poon is the Senior Director of Biomedical NLP at Microsoft Health Futures and an affiliated professor at the University of Washington Medical School. He leads Project Hanover, with the overarching goal of structuring medical data for precision medicine. He has given tutorials on this topic at top conferences. His research spans a wide range of problems in machine learning and natural language processing, and his prior work has been recognized with Best Paper Awards from many premier venues. He received his PhD in Computer Science and Engineering from University of Washington, specializing in machine learning and NLP.

Jianfeng Gao is a Distinguished Scientist of Microsoft Research and the Partner Research Manager of the Deep Learning (DL) group at Microsoft Research, AI. He leads the development of AI systems for natural language processing, web search, vision language understanding, dialogue, and business applications. He is an IEEE fellow and has received awards at top AI conferences. From 2014 to 2017, he was Partner Research Manager at Deep Learning Technology Center at Microsoft Research, Redmond, where he was leading the research on deep learning for text and image processing. From 2006 to 2014, he was Principal Researcher at Natural Language Processing Group at Microsoft Research, Redmond, where he worked on web search, query understanding and reformulation, ads prediction, and statistical machine translation. He has worked in various roles for Microsoft Research in natural language and beyond since 2000.

*This on-demand webinar features a previously recorded Q&A session and open captioning.