About me

Hi! I am Leo, a Ph.D. candidate in computer science at National Taiwan University (NTU), advised by Hung-yi Lee and Lin-shan Lee in the Speech Processing and Machine Learning (SPML) Group. I am fortunate enough to work with several thoughtful and respectable senior researchers, including Shinji Watanabe (CMU) and Abdelrahman Mohamed (Meta/Rembrand). I also learn a lot from and am continuously motivated by several strong peers, including Jiatong Shi (CMU), Wen-Chin Huang (Nagoya), and Andy T. Liu (NTU). Finally, I enjoy playing the piano in my free time, under the guidance of Yiin Bin Yang. (See my hobbies)

(My Curriculum Vitae)


My research is dedicated to developing a human-level perception system that comprehensively understands speech—from acoustics to linguistics—and its interplay with other modalities such as audio, vision, and natural language. My primary direction is on representation learning, a field that has garnered various names recently. My recent efforts concentrate on self-supervised learning, representation generalizability, and efficient pre-training.

  • Self-Supervised Learning (SSL): Learning speech representations from unlabeled data. We discover that speech SSL techniques lead to representations with strong task generalizability beyond Automatic Speech Recognition (ASR). Additionally, we explore their use across a broad spectrum of real-life speech applications, which marks the beginning of the era of speech foundation models (SFM).

  • Representation Generalizability: Benchmarking the task and domain generalizability of SFMs. I think deeply about the purpose and methods of creating a correct and solidly grounded benchmark, especially regarding its important role in guiding future model development.

  • Efficient Pre-training: All existing SFMs require industrial-level computing, which makes further research monopolized by large corporations. I am currently working on how to pre-train SFMs efficiently within academic resources.


I coordinated (as the research and engineering lead) the initial version of SUPERB (Speech processing Universal PERformance Benchmark), where the proposed speech foundation model (SFM) paradigm has influenced numerous works, as seen in additional benchmarks like SUPERB-SG, SUPERB-prosody, ML-SUPERB, and Dynamic-SUPERB. This influence extends to the development of SFMs, such as Unispeech-SAT, WavLM, and the compression of SFMs, including DistilHuBERT, LightHuBERT, and ARMHuBERT.

I also co-founded the S3PRL Toolkit with Andy T. Liu (NTU) in 2019, with support and advice from Hung-yi Lee (NTU). Throughout several years, I have collaborated with over 40 contributors, to whom I extend my sincere thanks. The major contributors are highlighted in the Change Log. The toolkit supports the pre-training of several classical SSL methods, benchmarking of numerous downstream tasks, and offers the most comprehensive collection of pre-trained SSL models to track research history. It is widely used by the community, including toolkits like ESPnet, S3PRL-VC and numerous open-source projects.

I am always open to collaborations involving dense and deep discussions, where I can learn from new explorations and intense debates regardless of co-authorship. If you are interested in collaborating, please reach me at my email: leo19941227@gmail.com.

Selected Publications

Visitors