Yiwen Shao

Senior Research Scientist · Tencent AI Lab, Bellevue, WA

I lead the Speech & Audio Understanding research at Tencent AI Lab, advised by Dr. Dong Yu. I focus on audio foundation models, including audio representation learning (audio encoders/tokenizers) and Large Audio-Language Models (LALMs).

Previously, I was a Ph.D student from the Center for Language and Speech Processing (CLSP) at Johns Hopkins University, advised by Dr. Daniel Povey and Dr. Sanjeev Khudanpur, where I worked primarily on Automatic Speech Recognition (ASR) and related topics. I got my Bachelor's degree from the School of Electronic Science & Engineering at Southeast University in 2017.

I am passionate about open-source software and believe in making research accessible to the broader community. Here are some of the research toolkits and systems I have developed or contributed to:

Auden (2025–present) – Comprehensive toolkit for audio foundation models. (Lead Developer)
IBM Adversarial Robustness Toolbox (2021) – Library for adversarial machine learning. (Contributor of ASR modules.)
PyChain (2020) – PyTorch implementation of lattice-free MMI for end-to-end ASR with fully parallelized cuda kernels. (Lead Developer)
Espresso (2019) – End-to-end neural speech recognition toolkit built on Fairseq. (Contributor)
Kaldi (2017) – Speech recognition toolkit with extensive training recipes. (Contributor)

Speech & Audio
Large Audio-Language Models
Representation Learning

Updates

2026-01 3 papers accepted by ICASSP 2026
2026-01 Released 1 technical report (first authored) on Speech-LLM
2025-12 Open-sourced Auden, a toolkit for building audio foundation models
2025-12 Will present 1 paper at ASRU 2025. See you in Hawaii!
2025-11 1 paper accepted by AAAI 2026
2025-05 2 papers accepted by Interspeech 2025
2024-09 Joined Tencent AI Lab (Bellevue) as a Senior Research Scientist.
2024-09 2 papers (1 first authored) accepted by SLT 2024
2024-06 2 papers (both first authored) accepted by Interspeech 2024
2023-12 1 paper accepted by ICASSP 2024
2023-12 Gave a talk at JHU CLSP Seminar:
2022-07 2 papers (1 first authored) accepted by Interspeech 2022
2022-01 1 paper (first authored) accepted by ICASSP 2022

Recent Works

^* equal contribution, ^† corresponding author or industry lead

SPEECH & MULTIMODAL LLMS

AZEROS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning

Yiwen Shao^*†, Wei Liu^*, Jiahong Li^*, Tianzi Wang, Kun Wei, Meng Yu, Dong Yu

Self-generated instruction-free tuning eliminates task-specific data, achieving SOTA on VoiceBench and AIR-Bench with minimal training.

technical report 2026 Auden Auden

TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding

Mingyue Huo, Yiwen Shao, Yuheng Zhang

LLM-based framework with temporal anchor grounding for joint multi-speaker ASR and diarization.

preprint 2026 Auden Auden

Efficient Scaling for LLM-based ASR

Bingshen Mu, Yiwen Shao^†, Kun Wei, Dong Yu, Lei Xie

EFIN (Encoder First Integration) training paradigm achieves more efficient and effective LLM-ASR.

ASRU 2025 Auden

Advancing multi-talker ASR performance with large language models

Mohan Shi, Zengrui Jin, Yaoxun Xu, Yong Xu, Shi-Xiong Zhang, Kun Wei, Yiwen Shao, Chunlei Zhang, Dong Yu

Leveraging large language models to advance multi-talker ASR performance.

SLT 2024

AUDIO ENCODERS/TOKENIZERS

General-purpose audio pretraining project preview

Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods

Xuanru Zhou, Yiwen Shao^†, Wei-Cheng Tseng, Dong Yu

Data-centric strong-supervision audio pretraining with high-fidelity captions and unified tags across speech, music, environmental sounds.

preprint 2025

TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation

Wei Liu^*, Jiahong Li^*, Yiwen Shao^†, Dong Yu

Jointly trains ASR, translation, and contrastive objectives on a shared encoder, yielding strong cross-lingual semantic representations.

ICASSP 2026 Auden Auden

Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding

Mingyue Huo, Wei-Cheng Tseng, Yiwen Shao^†, Hao Zhang, Dong Yu

Multi-task training yields balanced identity and paralinguistic voice encoder; Auden-Voice integrates strongly with LLMs.

ICASSP 2026 Auden Auden

Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation

Wei-Cheng Tseng, Xuanru Zhou, Mingyue Huo, Yiwen Shao^†, Hao Zhang, Dong Yu

10.7M CaptionStew enables the first contrastive vs. captioning study for general-purpose audio-language representations.

preprint 2025 Auden CaptionStew

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

Yuanyuan Wang, Dongchao Yang, Yiwen Shao^†, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, Xixin Wu

Dual-token speech modeling with LLMs for unified speech understanding and generation across tasks.

AAAI 2026

ASR

Efficient Multilingual ASR Finetuning via LoRA Language Experts

Jiahong Li, Yiwen Shao^†, Jianheng Zhuo, Chenda Li, Liliang Tang, Dong Yu, Yanmin Qian

LoRA language experts via fusion/distillation improve multilingual ASR by 10-15% relative gain over standard fine-tuning.

interspeech 2025

VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining

Jianheng Zhuo, Yifan Yang, Yiwen Shao^†, Yong Xu, Dong Yu, Kai Yu, Xie Chen

Cost-effective ASR training pipeline for low-resource languages using multi-iteration ASR-biased self-supervised learning.

interspeech 2025 VietASR

Selected Publications

^* equal contribution

ASR

Spatialemb: Extract and Encode Spatial Information for 1-Stage Multi-Channel Multi-Speaker ASR on Arbitrary Microphone Arrays

Yiwen Shao, Yong Xu, Sanjeev Khudanpur, Dong Yu

Novel spatial embedding approach for single-stage multi-channel multi-speaker ASR that works with arbitrary microphone array configurations.

SLT 2024

Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment

Yiwen Shao, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Daniel Povey, Sanjeev Khudanpur

Leverages target speaker's solo segments for improved multi-channel multi-speaker ASR through speaker-conditioned modeling.

Interspeech 2024

RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

Yiwen Shao, Shi-Xiong Zhang, Dong Yu

Novel spatial feature based on room impulse response for multi-channel multi-speaker ASR, achieving improved target speaker recognition.

Interspeech 2024

Multi-channel multi-speaker ASR using 3D spatial feature

Yiwen Shao^*, Shi-Xiong Zhang^*, Dong Yu

Using 3D spatial features from depth camera to enable end-to-end multi-speaker ASR via All-In-One model, achieving 31% CERR over 1D directional approaches.

ICASSP 2022

PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR

Yiwen Shao, Yiming Wang, Daniel Povey, Sanjeev Khudanpur

A Pytorch native implementation of LF-MMI loss with fully parallelized cuda kernels. Intergrated into espresso and serves as a pioneering work of next-gen kaldi.

Interspeech 2020 PyChain

Experience

Senior Research Scientist, Tencent AI Lab (Bellevue)

2024 - Present

Supervisor: Dong Yu

Working on speech and multimodal large language models, including ASR, speech-LLMs and large-scale audio pretraining.

Research Intern, Tencent AI Lab (Bellevue)

2021 - 2024 (multiple terms)

Supervisors: Shi-Xiong Zhang, Dong Yu

Developed audio-visual models for multi-channel multi-speaker speech recognition.

Research Assistant, CLSP, JHU

2017 - 2024

Advisors: Daniel Povey, Sanjeev Khudanpur

Worked on Automatic Speech Recognition (ASR) and related topics.

Contact

I am happy to discuss research collaborations, internships, or anything related to speech, audio, and multimodal AI.