传媒公司主要做什么| 黄褐斑是什么样的图片| 八仙过海开过什么生肖| 湿疹是什么意思| 苏字五行属什么| 墨鱼和鱿鱼有什么区别| 什么样的人可以通灵| 男人最怕什么| 碳14呼气试验阳性是什么意思| 49岁属什么| 化疗后吃什么补身体| 农历闰六月有什么讲究| mil是什么单位| 膈是什么器官| 现在小麦什么价格| 腋毛什么时候开始生长| 蟑螂屎长什么样| 吴亦凡帅到什么程度| 双子座男生喜欢什么样的女生| 献血对身体有什么好处| 什么是便血| 肝阳上亢吃什么中成药| 宝宝咳嗽流鼻涕吃什么药| 玫瑰痤疮是什么原因| 什么是孤独| 今天吃什么菜好呢| 男性内分泌失调吃什么药| wi-fi是什么意思| 12月8日是什么星座| 血压突然升高是什么原因| 柱镜是什么意思| 姜为什么不能晚上吃| 吃什么排铅| 日本为什么要侵略中国| 婴儿打嗝是什么原因引起的| 血管很明显是什么原因| 陪伴是最长情的告白下一句是什么| 中招是什么意思| 洋红色是什么颜色| 100001是什么电话| 痔疮嵌顿是什么意思| 霉菌性阴炎用什么药好得快| 什么星座最聪明| 尿常规查什么| 小便多吃什么药好| 烧碱是什么| 尿肌酐低是什么原因| 一个丝一个鸟读什么| 壤土适合种植什么植物| 转氨酶高吃什么好得快| 1985年属牛的是什么命| 塔丝隆是什么面料| 打了麻药有什么副作用| 不带壳的蜗牛叫什么| 左耳朵嗡嗡响是什么原因引起的| 梦见包丢了是什么意思| 吃什么菜能降血糖| 蜂蜡有什么用| 刚需是什么意思| taco什么意思| 猫不喜欢什么味道| 八字伏吟是什么意思| 红斑狼疮是什么症状能治好吗| doki是什么意思| 小狗能看见什么颜色| 属马女和什么属相最配| 水泻拉肚子吃什么药| 毛豆炒什么好吃| 椒盐是什么调料| TPS什么意思| 怀孕出血是什么颜色的| 韩五行属什么的| 放臭屁是什么原因| 红斑狼疮有什么症状| 男性小便出血是什么原因| 经由是什么意思| h 是什么意思| 被螨虫咬了非常痒用什么药膏好| 马粟是什么| 白细胞和淋巴细胞偏高是什么原因| 什么字属金| 派出所所长是什么级别| 透明隔间腔是什么意思| 羊传染人的病叫什么名| 虾腹部的黑线是什么| 为什么眨眼睛| 拉拉裤后面的胶片是做什么用的| 尿多是什么回事| 93年属相是什么| 什么叫细胞| 刘备代表什么生肖| 毛手毛脚什么意思| 小三阳吃什么食物好得快| 靓字五行属什么| 铁蛋白高是什么意思| 脾虚痰湿吃什么中成药| 血涂片检查什么病| 吃什么补肾最好| 2是什么数| 双肺多发结节是什么意思| 世界上最大的湖是什么湖| 玻璃体混浊吃什么药好| 床塌了有什么预兆| 眷属是什么意思| 懿怎么读 什么意思| 正规医院减肥挂什么科| 腿浮肿是什么原因引起的| 脸上老是长闭口粉刺是什么原因| 阴囊湿疹用什么药膏效果最好| 最高的学历是什么| 整装是什么意思| 号召是什么意思| 吃什么减肥效果最好最快| 身上痒是什么原因引起的| 子鼠是什么意思| 生肖马和什么生肖相冲| 吃青提有什么好处| m k是什么牌子| 切除子宫对身体有什么伤害| 不解之谜的意思是什么| 70大寿有什么讲究| 右脸突然肿了是什么原因| 阴囊湿疹吃什么药| 脱肛是什么原因引起的| 学生证件号码是什么| 孕妇鼻子出血是什么原因| 什么高什么低| 哀鸿遍野什么意思| 法院庭长是什么级别| 纤维灶是什么意思| 振五行属什么| 面皮是什么做的| 女人腿肿应该检查什么| 枸杞泡茶有什么功效| 血吸虫是什么动物| 血小板为什么会减少| 甘油三酯高吃什么药| 前列腺钙化斑是什么意思| 书的五行属性是什么| 6月28日是什么日子| 胃病喝什么茶养胃| lsp是什么意思| 手足口病要注意什么| dv是什么牌子| 今年66岁属什么生肖的| 什么是特殊膳食| 湿疹用什么药膏| 空气栓塞取什么卧位| 农历4月是什么星座| 痛心疾首的疾什么意思| 误机是什么意思| 鱼露是什么| 迈巴赫是什么车| fc什么意思| 什么时候测血压最准确| 胎停了有什么症状| 两肺纹理增多是什么意思| 早上头晕是什么原因| 左手大拇指麻木是什么原因| 软组织感染是什么意思| e代表什么方向| 龙井茶属于什么茶| 乐五行属什么| 张飞为什么不救关羽| 霉菌性阴道炎什么症状| 脱毛膏是什么原理| 肺大泡是什么病| 碳水化合物是什么意思| 男性漏尿是什么原因| 猴子尾巴的作用是什么| 西瓜为什么是红色的| 内疚是什么意思| 旺字五行属什么| 贪恋是什么意思| 牟利什么意思| 感冒拉肚子吃什么药| 滑板鞋是什么鞋| 全科门诊主要看什么| 口腔苔藓用什么药| 什么叫代谢| 什么病治不好| 属牛幸运色是什么颜色| 早孕试纸什么时候测最准确| jk是什么| 大校相当于地方什么级别| 忽然流鼻血是什么原因引起的| 脾胃不好吃什么食物好| 好整以暇什么意思| 中古包是什么意思| 肱骨头小囊变什么意思| 大什么| 急性扁桃体化脓是什么原因引起的| 14年属什么生肖| 腿肚子抽筋是什么原因| 6.27什么星座| 孕32周需要做什么检查| 宝宝消化不好吃什么调理| 为什么会流产| 过敏性紫癜挂什么科| ot是什么| 婆什么起舞| 脑电图异常是什么病| dhc是什么| 父亲节出什么生肖| 信手拈来是什么意思| 情感和感情有什么区别| 淋巴结是什么原因引起的| 雪纺是什么面料| 八七年属什么的| 什么是基本养老金| 2月19是什么星座| 棉花糖是什么做的| xy什么意思| 蟑螂长什么样| 怀孕生气对胎儿有什么影响| 月经不调有什么症状| 老年人心慌是什么原因| 感冒为什么会全身酸痛无力| 寻麻疹不能吃什么| 消停是什么意思| 冷泡茶用什么茶叶| 掉睫毛是什么原因| 什么的毛主席| hpv病毒是什么| 卷柏属于什么植物| 心电图t波改变什么意思| gy是什么颜色| 差异是什么意思| 舌头裂缝是什么原因| 郎中是什么意思| 茯苓有什么作用| 古井贡酒属于什么档次| 未见血流信号是什么意思| 胆固醇高应注意什么| 什么可以吃| 手发抖吃什么药| 处女座上升星座是什么| 梦见订婚是什么意思| 什么叫痉挛| 子宫内膜回声欠均匀什么意思| 龙虾的血是什么颜色的| 低头什么节| 偏心是什么意思| 不懂事是什么意思| 老生气会得什么病| clinic是什么意思| 检查肾挂什么科| 经血逆流的症状是什么| 为什么射出的精子里有淡红色| 彼岸花是什么花| 晚上睡觉脚抽搐是什么原因| anxiety什么意思| 肠道菌群失调吃什么药| 死心眼什么意思| 12月13号什么星座| 恏是什么意思| 肾虚会导致什么| 醛固酮高有什么危害| 土色是什么颜色的图片| 精油是干什么用的| 鸡蛋吃多了有什么坏处| 山楂和什么泡水喝最好| 幽门螺旋杆菌吃什么药最好| 梦到自己牙齿掉了是什么意思| 百度
Skip to content

zhvng/open-musiclm

Repository files navigation

Open MusicLM

Pytorch implementation of MusicLM, a SOTA text to music model published by Google, with a few modifications. We use CLAP as a replacement for MuLan, Encodec as a replacement for SoundStream, and MERT as a replacement for w2v-BERT.

diagram of MusicLM diagram of CLAP

Why CLAP?

CLAP is a joint audio-text model trained on LAION-Audio-630K. Similar to MuLan, it consists of an audio tower and a text tower that project their respective media onto a shared latent space (512 dimensions in CLAP vs 128 dimensions in MuLan).

MuLan was trained on 50 million text-music pairs. Unfortunately I don't have the data to replicate this, so I'm relying on CLAP's pretrained checkpoints to come close. CLAP was trained on 2.6 million total text-audio pairs from LAION-630k (~633k text-audio pairs) and AudioSet (2 million samples with captions generated by a keyword-to-caption model). Although this is a fraction of the data used to train MuLan, we have successfully used CLAP to generate diverse music samples, which you can listen to here (keep in mind these are very early results). In the event that CLAP's latent space is not expressive enough for music generation, we can train CLAP on music or substitute the model for @lucidrain's MuLan implementation once it is trained.

Why Encodec?

SoundStream and Encodec are both neural audio codecs that encode any waveform to a sequence of acoustic tokens, which can then be decoded into a waveform resembling the original. These intermediate tokens can then be modeled as a seq2seq task. Encodec is released by Facebook and pretrained checkpoints are publicly available, whereas this is not the case with SoundStream.

Differences from @lucidrains implementation

  • Autoregressively models the CLAP/MuLan conditioning signal by passing it into the transformers as discrete tokens, as mentioned in section 3.1 of the paper. Musiclm-pytorch conditions on them with cross attention.
  • TokenConditionedTransformer can support variable token sequences, which makes it easy to do further experimentation (e.g. combining multiple conditioning signals, stereo waveform generation, etc.)
  • Uses existing open source models instead of training MuLan and SoundStream.
  • Some modifications to increase the chance of successfully training the model.

End Goal

The goal of this project is to replicate the results of MusicLM as quickly as possible without necessarily sticking to the architecture in the paper. For those looking for a more true-to-form implementation, check out musiclm-pytorch.

We also seek to gain a better understanding of CLAP's latent space.

Join us on discord if you'd like to get involved! join discord

Usage

Install

conda env create -f environment.yaml
conda activate open-musiclm

Configs

A "model config" contains information about the model architecture such as the number of layers, number of quantizers, target audio lengths for each stage, etc. It is used to instantiate the model during training and inference.

A "training config" contains hyperparameters for training the model. It is used to instantiate the trainer classes during training.

See the ./configs directory for example configs.

Training

CLAP RVQ

The first step is to train the residual vector quantizer that maps continuous CLAP embeds to a discrete token sequence.

python ./scripts/train_clap_rvq.py \
    --results_folder ./results/clap_rvq \ # where to save results and checkpoints
    --model_config ./configs/model/musiclm_small.json \ # path to model config
    --training_config ./configs/training/train_musiclm_fma.json # path to training config

Hubert K-means

Next, we learn a K-means layer that we use to quantize our MERT embeddings into semantic tokens.

python ./scripts/train_hubert_kmeans.py \
    --results_folder ./results/hubert_kmeans \ # where to save results and checkpoints
    --model_config ./configs/model/musiclm_small.json \
    --training_config ./configs/training/train_musiclm_fma.json

Semantic Stage + Coarse Stage + Fine Stage

Once we have a working K-means and RVQ, we can now train the semantic, coarse and fine stages. These stages can be trained concurrently.

python ./scripts/train_semantic_stage.py \
    --results_folder ./results/semantic \ # where to save results and checkpoints
    --model_config ./configs/model/musiclm_small.json \
    --training_config ./configs/training/train_musiclm_fma.json \
    --rvq_path PATH_TO_RVQ_CHECKPOINT \ # path to previously trained rvq
    --kmeans_path PATH_TO_KMEANS_CHECKPOINT # path to previously trained kmeans
python ./scripts/train_coarse_stage.py \
    --results_folder ./results/coarse \ # where to save results and checkpoints
    --model_config ./configs/model/musiclm_small.json \
    --training_config ./configs/training/train_musiclm_fma.json \
    --rvq_path PATH_TO_RVQ_CHECKPOINT \ # path to previously trained rvq
    --kmeans_path PATH_TO_KMEANS_CHECKPOINT # path to previously trained kmeans
python ./scripts/train_fine_stage.py \
    --results_folder ./results/fine \ # where to save results and checkpoints
    --model_config ./configs/model/musiclm_small.json \
    --training_config ./configs/training/train_musiclm_fma.json \
    --rvq_path PATH_TO_RVQ_CHECKPOINT \ # path to previously trained rvq
    --kmeans_path PATH_TO_KMEANS_CHECKPOINT # path to previously trained kmeans

Preprocessing

In the above case, we are using CLAP, Hubert and Encodec to generate clap, semantic and acoustic tokens live during training. However, these models take up space on the GPU, and it is inefficient to recompute these tokens if we're making multiple runs on the same data. We can instead compute these tokens ahead of time and iterate over them during training.

To do this, fill in the data_preprocessor_cfg field in the config and set use_preprocessed_data to True in the trainer configs (look at train_fma_preprocess.json for inspiration). Then run the following to preprocess the dataset, followed by your training script.

python ./scripts/preprocess_data.py \
    --model_config ./configs/model/musiclm_small.json \
    --training_config ./configs/training/train_fma_preprocess.json \
    --rvq_path PATH_TO_RVQ_CHECKPOINT \ # path to previously trained rvq
    --kmeans_path PATH_TO_KMEANS_CHECKPOINT # path to previously trained kmeans

Inference

Generate multiple samples and use CLAP to select the best ones:

python scripts/infer_top_match.py \
    "your text prompt"
    --num_samples 4                                 # number of samples to generate
    --num_top_matches 1                             # number of top matches to return
    --semantic_path PATH_TO_SEMANTIC_CHECKPOINT \   # path to previously trained semantic stage
    --coarse_path PATH_TO_COARSE_CHECKPOINT \       # path to previously trained coarse stage
    --fine_path PATH_TO_FINE_CHECKPOINT \           # path to previously trained fine stage
    --rvq_path PATH_TO_RVQ_CHECKPOINT \             # path to previously trained rvq
    --kmeans_path PATH_TO_KMEANS_CHECKPOINT         # path to previously trained kmeans
    --model_config ./configs/model/musiclm_small.json \
    --duration 4

Generate samples for various test prompts:

python scripts/infer.py \
    --semantic_path PATH_TO_SEMANTIC_CHECKPOINT \   # path to previously trained semantic stage
    --coarse_path PATH_TO_COARSE_CHECKPOINT \       # path to previously trained coarse stage
    --fine_path PATH_TO_FINE_CHECKPOINT \           # path to previously trained fine stage
    --rvq_path PATH_TO_RVQ_CHECKPOINT \             # path to previously trained rvq
    --kmeans_path PATH_TO_KMEANS_CHECKPOINT         # path to previously trained kmeans
    --model_config ./configs/model/musiclm_small.json \
    --duration 4

You can use the --return_coarse_wave flag to skip the fine stage and reconstruct audio from coarse tokens alone.

Checkpoints

You can download experimental checkpoints for the musiclm_large_small_context model here. To fine tune the model, call the train scripts with the --fine_tune_from flag.

Thank you

Citations

@inproceedings{Agostinelli2023MusicLMGM,
    title     = {MusicLM: Generating Music From Text},
    author    = {Andrea Agostinelli and Timo I. Denk and Zal{\'a}n Borsos and Jesse Engel and Mauro Verzetti and Antoine Caillon and Qingqing Huang and Aren Jansen and Adam Roberts and Marco Tagliasacchi and Matthew Sharifi and Neil Zeghidour and C. Frank},
    year      = {2023}
}
@article{wu2022large,
  title     = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
  author    = {Wu, Yusong and Chen, Ke and Zhang, Tianyu and Hui, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
  journal   = {arXiv preprint arXiv:2211:06687},
  year      = {2022},
}
@article{defossez2022highfi,
  title     = {High Fidelity Neural Audio Compression},
  author    = {Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  journal   = {arXiv preprint arXiv:2210.13438},
  year      = {2022}
}
@misc{li2023mert,
  title     = {MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training}, 
  author    = {Yizhi Li and Ruibin Yuan and Ge Zhang and Yinghao Ma and Xingran Chen and Hanzhi Yin and Chenghua Lin and Anton Ragni and Emmanouil Benetos and Norbert Gyenge and Roger Dannenberg and Ruibo Liu and Wenhu Chen and Gus Xia and Yemin Shi and Wenhao Huang and Yike Guo and Jie Fu},
  year      = {2023},
  eprint    = {2306.00107},
  archivePrefix = {arXiv},
  primaryClass  = {cs.SD}
}
白带发黄什么原因 功什么不什么 耳鸣什么原因 娃娃鱼属于什么类动物 收缩压低是什么原因
id锁是什么 膀胱在什么位置图片 樱花什么时候开花 灵犀是什么意思 情趣什么意思
喉咙疼痛吃什么药 纹身的人是什么心理 什么是平板电脑 什么补血效果最好最快 nibpdia过高是什么意思
弄虚作假是什么生肖 倭瓜是什么意思 口干口苦是什么原因 冬瓜不能和什么一起吃 尿频尿急尿痛吃什么药
脾胃虚弱吃什么hcv7jop7ns4r.cn 财星是什么意思hcv8jop4ns7r.cn 排黑便是什么原因qingzhougame.com kj什么意思hcv8jop6ns5r.cn 蕾字五行属什么hcv9jop1ns8r.cn
血小板低吃什么水果好hcv7jop4ns7r.cn 出汗少的人是什么原因hcv8jop5ns2r.cn 衣服36码相当于什么码hcv9jop2ns2r.cn 心率过速是什么原因hcv9jop2ns3r.cn 霜降是什么时候hcv9jop5ns2r.cn
吃什么可以补铁hcv8jop8ns2r.cn 心梗用什么药最好hcv7jop9ns3r.cn 三氧化硫常温下是什么状态96micro.com 冷冻和冷藏有什么区别hcv7jop7ns0r.cn 风流倜傥是什么意思hcv9jop4ns9r.cn
蛊是什么意思hcv8jop6ns6r.cn afp检查是什么意思hcv8jop8ns7r.cn absolue是兰蔻的什么产品hcv8jop6ns6r.cn 执着什么意思hcv9jop2ns2r.cn 两个a型血的人生的孩子什么血型cl108k.com
百度