Yu Zhang, Wenxiang Guo, Changhao Pan, Dongyu Yao, Zhiyuan Zhu, Ziyue Jiang, Yuhan Wang, Tao Jin, Zhou Zhao | Zhejiang University
PyTorch implementation of TCSinger 2 (ACL 2025): Customizable Multilingual Zero-shot Singing Voice Synthesis.
Visit our demo page for audio samples.
- 2025.07: We released the code of TCSinger 2!
- 2025.07: We realeased the code of STARS!
- 2025.05: TCSinger 2 is accepted by ACL 2025!
- We present TCSinger 2, a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts.
- We introduce the Blurred Boundary Content Encoder for robust modeling and smooth transitions of phoneme and note boundaries.
- We design the Custom Audio Encoder using contrastive learning to extract styles from various prompts, while the Flow-based Custom Transformer with Cus-MOE and F0, enhances synthesis quality and style modeling.
- Experimental results show that TCSinger 2 outperforms baseline models in subjective and objective metrics across multiple tasks: zero-shot style transfer, multi-level style control, cross-lingual style transfer, and speech-to-singing style transfer.
We provide an example of how you can train your own model and infer with TCSinger 2.
To try on your own dataset, clone this repo on your local machine with NVIDIA GPU + CUDA cuDNN and follow the instructions below.
A suitable conda environment named tcsinger2
can be created
and activated with:
conda create -n tcsinger2 python=3.10
conda install --yes --file requirements.txt
conda activate tcsinger2
By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count()
.
You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE
environment variable before running the training module.
- Collect your own singing dataset, e.g., including GTSinger, and feel free to add extra data annotated with alignment tools, like STARS.
- Place
metadata.json
(fields:ph
,word
,item_name
,ph_durs
,wav_fn
,singer
,ep_pitches
,ep_notedurs
,ep_types
,emotion
,singing_method
,technique
) andphone_set.json
(complete phoneme list) in the desired folder and update the paths inpreprocess/preprocess.py
. (A referencemetadata.json
is provided in GTSinger.) Please present thesinger
attribute as a description specifying the performer’s gender and vocal range, and render thetechnique
attribute either as a concise text listing of skills or as a natural-language account that conveys their sequential order. - Extract F0 for each
.wav
, save as*_f0.npy
, e.g. with RMVPE. - Download HIFI-GAN as the vocoder in
useful_ckpts/hifigan
and FLAN-T5 inuseful_ckpts/flan-t5-large
. - Preprocess the dataset:
export PYTHONPATH=.
python preprocess/preprocess.py
Tip: You may also convert your dataset directly to a .csv
instead of using metadata.json
.
- Compute mel-spectrograms:
python preprocess/mel_spec_48k.py --tsv_path data/new/data.tsv --num_gpus 1 --max_duration 20
- Post-process:
python preprocess/postprocess_data.py
- Train the VAE module and duration predictor
python main.py --base configs/ae_singing.yaml -t --gpus 0,1,2,3,4,5,6,7
- Train the main TCSinger 2 model
python main.py --base configs/tcsinger2.yaml -t --gpus 0,1,2,3,4,5,6,7
Notes
- Adjust the compression ratio in the config files (and related scripts).
- Change the padding length in the dataloader as needed.
- To train the Custom Audio Encoder, format data as in
ldm/data/joinaudiodataset_con.py
, set the trained VAE path inae_con.yaml
, and proceed with training.
python scripts/test_sing.py
Replace the checkpoint path and CFG coefficient as required. For speech inputs, modify the VAE accordingly.
This implementation uses parts of the code from the following Github repos: Make-An-Audio-3, TCSinger Lumina-T2X as described in our code.
If you find this code useful in your research, please cite our work:
@article{zhang2025tcsinger,
title={TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis},
author={Zhang, Yu and Guo, Wenxiang and Pan, Changhao and Yao, Dongyu and Zhu, Zhiyuan and Jiang, Ziyue and Wang, Yuhan and Jin, Tao and Zhao, Zhou},
journal={arXiv preprint arXiv:2505.14910},
year={2025}
}
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's singing without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.