five

FMA-rank

收藏
魔搭社区2025-11-27 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/disco-eth/FMA-rank
下载链接
链接失效反馈
官方服务:
资源简介:
# What is FMA-rank? FMA is a music dataset from the Free Music Archive, containing over 8000 hours of Creative Commons-licensed music from 107k tracks across 16k artists and 15k albums. It was created in 2017 by [Defferrard et al.](https://arxiv.org/abs/1612.01840) in collaboration with [Free Music Archive](https://freemusicarchive.org/). FMA contains a lot of good music, and a lot of bad music, so the question is: can we rank the samples in FMA? FMA-rank is a CLAP-based statistical ranking of each sample in FMA. We calculate the log-likelihood of each sample in FMA belonging to an estimated gaussian in the CLAP latent space, using these values we can rank and filter FMA. In log-likelihood, higher values are better. # Quickstart Download any FMA split from the official github https://github.com/mdeff/fma. Extract the FMA folder from the downloaded zip and set the path to the folder in `fma_root_dir`. Run the following code snippet to load and filter the FMA samples according to the given percentages. The code snippet will return a HF audio dataset. ```Python from datasets import load_dataset, Dataset, Audio import os # provide location of fma folder fma_root_dir = "/path/to/fma/folder" # provide percentage of fma dataset to use # for whole dataset, use start_percentage=0 and end_percentage=100 # for worst 20% of dataset, use start_percentage=0 and end_percentage=20 # for best 20% of dataset, use the following values: start_percentage = 80 end_percentage = 100 # load fma_rank.csv from huggingface and sort from lowest to highest csv_loaded = load_dataset("disco-eth/FMA-rank") fma_item_list = csv_loaded["train"] fma_sorted_list = sorted(fma_item_list, key=lambda d: d['CLAP-log-likelihood']) def parse_fma_audio_folder(fma_root_dir): valid_fma_ids = [] subfolders = os.listdir(fma_root_dir) for subfolder in subfolders: subfolder_path = os.path.join(fma_root_dir, subfolder) if os.path.isdir(subfolder_path): music_files = os.listdir(subfolder_path) for music_file in music_files: if ".mp3" not in music_file: continue else: fma_id = music_file.split('.')[0] valid_fma_ids.append(fma_id) return valid_fma_ids # select the existing files according to the provided fma folder valid_fma_ids = parse_fma_audio_folder(fma_root_dir) df_dict = {"id":[], "score": [], "audio": []} for fma_item in fma_sorted_list: this_id = f"{fma_item['id']:06d}" if this_id in valid_fma_ids: df_dict["id"].append(this_id) df_dict["score"].append(fma_item["CLAP-log-likelihood"]) df_dict["audio"].append(os.path.join(fma_root_dir, this_id[:3] , this_id+".mp3")) # filter the fma dataset according to the percentage defined above i_start = int(start_percentage * len(df_dict["id"]) / 100) i_end = int(end_percentage * len(df_dict["id"]) / 100) df_dict_filtered = { "id": df_dict["id"][i_start:i_end], "score": df_dict["score"][i_start:i_end], "audio": df_dict["audio"][i_start:i_end], } # get final dataset audio_dataset = Dataset.from_dict(df_dict_filtered).cast_column("audio", Audio()) """ Dataset({ features: ['id', 'score', 'audio'], num_rows: 1599 }) """ ```

# 什么是FMA-rank? FMA是源自免费音乐档案馆(Free Music Archive)的音乐数据集,包含超8000小时的知识共享许可(Creative Commons)授权音乐,涵盖10.7万首曲目、1.6万名艺术家与1.5万张专辑。该数据集于2017年由Defferrard等人与Free Music Archive合作创建,相关学术论文可参见[Defferrard et al.](https://arxiv.org/abs/1612.01840),官方网站为[Free Music Archive](https://freemusicarchive.org/)。 FMA数据集同时包含优质与劣质音乐,由此引出一个核心问题:能否对FMA中的样本进行质量排序? FMA-rank是一种基于CLAP(对比语言-音频预训练模型,Contrastive Language-Audio Pretraining)的FMA样本统计排序方法。我们通过计算FMA中每个样本属于CLAP隐空间中估计高斯分布的对数似然值,以此实现对FMA数据集的排序与筛选。对数似然值越高,代表样本质量越好。 # 快速上手 从官方GitHub仓库https://github.com/mdeff/fma下载任意FMA数据集拆分包,将压缩包内的FMA文件夹解压,并在`fma_root_dir`变量中指定该文件夹的路径。运行以下代码片段即可根据指定比例加载并筛选FMA样本,该代码将返回一个Hugging Face音频数据集。 Python from datasets import load_dataset, Dataset, Audio import os # 指定FMA数据集文件夹的路径 fma_root_dir = "/path/to/fma/folder" # 指定需使用的FMA数据集比例 # 若需使用完整数据集,设置start_percentage=0且end_percentage=100 # 若需使用数据集质量最差的20%,设置start_percentage=0且end_percentage=20 # 若需使用数据集质量最优的20%,可使用如下参数: start_percentage = 80 end_percentage = 100 # 从Hugging Face加载fma_rank.csv文件,并按CLAP对数似然值升序排序 csv_loaded = load_dataset("disco-eth/FMA-rank") fma_item_list = csv_loaded["train"] fma_sorted_list = sorted(fma_item_list, key=lambda d: d['CLAP-log-likelihood']) def parse_fma_audio_folder(fma_root_dir): valid_fma_ids = [] subfolders = os.listdir(fma_root_dir) for subfolder in subfolders: subfolder_path = os.path.join(fma_root_dir, subfolder) if os.path.isdir(subfolder_path): music_files = os.listdir(subfolder_path) for music_file in music_files: if ".mp3" not in music_file: continue else: fma_id = music_file.split('.')[0] valid_fma_ids.append(fma_id) return valid_fma_ids # 根据指定的FMA文件夹筛选存在的音频文件 valid_fma_ids = parse_fma_audio_folder(fma_root_dir) df_dict = {"id":[], "score": [], "audio": []} for fma_item in fma_sorted_list: this_id = f"{fma_item['id']:06d}" if this_id in valid_fma_ids: df_dict["id"].append(this_id) df_dict["score"].append(fma_item["CLAP-log-likelihood"]) df_dict["audio"].append(os.path.join(fma_root_dir, this_id[:3] , this_id+".mp3")) # 根据上述定义的比例筛选FMA数据集 i_start = int(start_percentage * len(df_dict["id"]) / 100) i_end = int(end_percentage * len(df_dict["id"]) / 100) df_dict_filtered = { "id": df_dict["id"][i_start:i_end], "score": df_dict["score"][i_start:i_end], "audio": df_dict["audio"][i_start:i_end], } # 生成最终数据集 audio_dataset = Dataset.from_dict(df_dict_filtered).cast_column("audio", Audio()) """ 数据集信息({ 特征列: ['id', 'score', 'audio'], 样本总数: 1599 }) """
提供机构:
maas
创建时间:
2025-05-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作