Yiwen Shao

Senior Research Scientist · Tencent AI Lab, Bellevue, WA

I lead the Speech & Audio Understanding research at Tencent AI Lab, advised by Dr. Dong Yu. I focus on audio foundation models, including audio representation learning (audio encoders/tokenizers) and Large Audio-Language Models (LALMs).

Previously, I was a Ph.D student from the Center for Language and Speech Processing (CLSP) at Johns Hopkins University, advised by Dr. Daniel Povey and Dr. Sanjeev Khudanpur, where I worked primarily on Automatic Speech Recognition (ASR) and related topics. I got my Bachelor's degree from the School of Electronic Science & Engineering at Southeast University in 2017.

I am passionate about open-source software and believe in making research accessible to the broader community. Here are some of the research toolkits and systems I have developed or contributed to:

  • Auden (2025–present) GitHub stars – Comprehensive toolkit for audio foundation models. (Lead Developer)
  • IBM Adversarial Robustness Toolbox (2021) GitHub stars – Library for adversarial machine learning. (Contributor of ASR modules.)
  • PyChain (2020) GitHub stars – PyTorch implementation of lattice-free MMI for end-to-end ASR with fully parallelized cuda kernels. (Lead Developer)
  • Espresso (2019) GitHub stars – End-to-end neural speech recognition toolkit built on Fairseq. (Contributor)
  • Kaldi (2017) GitHub stars – Speech recognition toolkit with extensive training recipes. (Contributor)

  • Speech & Audio
  • Large Audio-Language Models
  • Representation Learning

Updates

Recent Works

* equal contribution, corresponding author or industry lead

SPEECH & MULTIMODAL LLMS
AZEROS project preview
AZEROS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning
Yiwen Shao*†, Wei Liu*, Jiahong Li*, Tianzi Wang, Kun Wei, Meng Yu, Dong Yu
Self-generated instruction-free tuning eliminates task-specific data, achieving SOTA on VoiceBench and AIR-Bench with minimal training.
TagSpeech project preview
TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding
Mingyue Huo, Yiwen Shao, Yuheng Zhang
LLM-based framework with temporal anchor grounding for joint multi-speaker ASR and diarization.
EFIN project preview
Efficient Scaling for LLM-based ASR
Bingshen Mu, Yiwen Shao, Kun Wei, Dong Yu, Lei Xie
EFIN (Encoder First Integration) training paradigm achieves more efficient and effective LLM-ASR.
Multi-talker ASR project preview
Advancing multi-talker ASR performance with large language models
Mohan Shi, Zengrui Jin, Yaoxun Xu, Yong Xu, Shi-Xiong Zhang, Kun Wei, Yiwen Shao, Chunlei Zhang, Dong Yu
Leveraging large language models to advance multi-talker ASR performance.
AUDIO ENCODERS/TOKENIZERS
General-purpose audio pretraining project preview
Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods
Xuanru Zhou, Yiwen Shao, Wei-Cheng Tseng, Dong Yu
Data-centric strong-supervision audio pretraining with high-fidelity captions and unified tags across speech, music, environmental sounds.
TTA project preview
TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation
Wei Liu*, Jiahong Li*, Yiwen Shao, Dong Yu
Jointly trains ASR, translation, and contrastive objectives on a shared encoder, yielding strong cross-lingual semantic representations.
Auden voice project preview
Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding
Mingyue Huo, Wei-Cheng Tseng, Yiwen Shao, Hao Zhang, Dong Yu
Multi-task training yields balanced identity and paralinguistic voice encoder; Auden-Voice integrates strongly with LLMs.
ArXiv 2511.16757 project preview
Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation
Wei-Cheng Tseng, Xuanru Zhou, Mingyue Huo, Yiwen Shao, Hao Zhang, Dong Yu
10.7M CaptionStew enables the first contrastive vs. captioning study for general-purpose audio-language representations.
DualSpeechLM project preview
DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, Xixin Wu
Dual-token speech modeling with LLMs for unified speech understanding and generation across tasks.
ASR
LoRA Language Experts project preview
Efficient Multilingual ASR Finetuning via LoRA Language Experts
Jiahong Li, Yiwen Shao, Jianheng Zhuo, Chenda Li, Liliang Tang, Dong Yu, Yanmin Qian
LoRA language experts via fusion/distillation improve multilingual ASR by 10-15% relative gain over standard fine-tuning.
VietASR project preview
VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining
Jianheng Zhuo, Yifan Yang, Yiwen Shao, Yong Xu, Dong Yu, Kai Yu, Xie Chen
Cost-effective ASR training pipeline for low-resource languages using multi-iteration ASR-biased self-supervised learning.

Selected Publications

* equal contribution

ASR
Spatialemb project preview
Spatialemb: Extract and Encode Spatial Information for 1-Stage Multi-Channel Multi-Speaker ASR on Arbitrary Microphone Arrays
Yiwen Shao, Yong Xu, Sanjeev Khudanpur, Dong Yu
Novel spatial embedding approach for single-stage multi-channel multi-speaker ASR that works with arbitrary microphone array configurations.
Solo Segment project preview
Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment
Yiwen Shao, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Daniel Povey, Sanjeev Khudanpur
Leverages target speaker's solo segments for improved multi-channel multi-speaker ASR through speaker-conditioned modeling.
RIR-SF project preview
RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios
Yiwen Shao, Shi-Xiong Zhang, Dong Yu
Novel spatial feature based on room impulse response for multi-channel multi-speaker ASR, achieving improved target speaker recognition.
3D Spatial ASR project preview
Multi-channel multi-speaker ASR using 3D spatial feature
Yiwen Shao*, Shi-Xiong Zhang*, Dong Yu
Using 3D spatial features from depth camera to enable end-to-end multi-speaker ASR via All-In-One model, achieving 31% CERR over 1D directional approaches.
PyChain project preview
PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR
Yiwen Shao, Yiming Wang, Daniel Povey, Sanjeev Khudanpur
A Pytorch native implementation of LF-MMI loss with fully parallelized cuda kernels. Intergrated into espresso and serves as a pioneering work of next-gen kaldi.

Experience

Senior Research Scientist, Tencent AI Lab (Bellevue)
2024 - Present
Supervisor: Dong Yu
Working on speech and multimodal large language models, including ASR, speech-LLMs and large-scale audio pretraining.
Research Intern, Tencent AI Lab (Bellevue)
2021 - 2024 (multiple terms)
Supervisors: Shi-Xiong Zhang, Dong Yu
Developed audio-visual models for multi-channel multi-speaker speech recognition.
Research Assistant, CLSP, JHU
2017 - 2024
Worked on Automatic Speech Recognition (ASR) and related topics.

Contact

I am happy to discuss research collaborations, internships, or anything related to speech, audio, and multimodal AI.