About me
Shu-wen Yang is a Ph.D. candidate (final year) in computer science at National Taiwan University (NTU), advised by Prof. Hung-yi Lee and Prof. Lin-shan Lee. He is looking for the full-time research scientist position. His research interest lies in representation learning for general speech encoders. He has published over 10 papers in speech/audio-related top conferences and journals, including Interspeech, ICASSP, TASLP and ICML. His research has accumulated over 2,260 citations and an h-index of 14 on Google Scholar. He co-organized the SUPERB benchmark and challenge, now adopted by over 40 institutions. He also co-created the S3PRL speech toolkit, which has earned over 2,400 stars on GitHub and is used by more than 150 open-source projects. He gave tutorials on speech representations at NAACL 2022, ICASSP 2022, and Interspeech 2022. He co-organized the SUPERB Challenge @ IEEE SLT 2022 and SPARKS Workshop @ IEEE ASRU 2023. He received the Google Ph.D. Fellowship in 2024. Finally, he enjoy playing the piano in the free time, under the guidance of Yiin Bin Yang. (See my hobbies)
(My Curriculum Vitae)
My primary direction is on speech representation learning, a field that has garnered various names recently. My recent efforts concentrate on self-supervised learning, representation generalizability, and audio generation.
Self-Supervised Learning (SSL): Learning speech representations from unlabeled data. We discover that speech SSL techniques lead to representations with strong task generalizability beyond Automatic Speech Recognition (ASR). Additionally, we explore their use across a broad spectrum of real-life speech applications, which marks the beginning of the era of speech foundation models (SFM).
Representation Generalizability: Benchmarking the task and domain generalizability of SFMs. I think deeply about the purpose and methods of creating a correct and solidly grounded benchmark, especially regarding its important role in guiding future model development.
Audio Generation: Most existing speech & audio language models rely on discrete tokens. However, the waveform is continuous in nature. We replace the discrete tokens with the continuous tokens and achieve SOTA-level text-to-audio generation, rivaling the leading diffusion models while being much faster in inference and streamable.
Selected Projects
I coordinated (as the research and engineering lead) the initial version of SUPERB (Speech processing Universal PERformance Benchmark), where the proposed speech foundation model (SFM) paradigm has influenced numerous works, as seen in additional benchmarks like SUPERB-SG, SUPERB-prosody, ML-SUPERB, and Dynamic-SUPERB. This influence extends to the development of SFMs, such as Unispeech-SAT, WavLM, and the compression of SFMs, including DistilHuBERT, LightHuBERT, and ARMHuBERT.
I also co-founded the S3PRL Toolkit with Andy T. Liu (NTU) in 2019, with support and advice from Hung-yi Lee (NTU). Throughout several years, I have collaborated with over 40 contributors, to whom I extend my sincere thanks. The major contributors are highlighted in the Change Log. The toolkit supports the pre-training of several classical SSL methods, benchmarking of numerous downstream tasks, and offers the most comprehensive collection of pre-trained SSL models to track research history. It is widely used by the community, including toolkits like ESPnet, S3PRL-VC and numerous open-source projects.
I am always open to collaborations involving dense and deep discussions, where I can learn from new explorations and intense debates regardless of co-authorship. If you are interested in collaborating, please reach me at my email: leo19941227@gmail.com.
Selected Publications
SUPERB: Speech processing Universal PERformance Benchmark
Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee
in Interspeech, 2021
arxiv / video / website / codeA Large-Scale Evaluation of Speech Foundation Models
Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-yi Lee
in IEEE/ACM Transactions on Audio Speech and Language Processing, 2024
arxiv (preferred) / ieee / codeGenerative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction
Shu-wen Yang, Byeonggeun Kim, Kuan-Po Huang, Qingming Tang, Huy Phan, Bo-Ru Lu, Harsha Sundar, Shalini Ghosh, Hung-yi Lee, Chieh-Chi Kao, Chao Wang
in ICML, 2025
arxiv (comming soon)
Visitors