FMA-rank

Name: FMA-rank
Creator: maas
Published: 2025-11-27 16:34:33
License: 暂无描述

魔搭社区2025-11-27 更新2025-05-24 收录

下载链接：

https://modelscope.cn/datasets/disco-eth/FMA-rank

下载链接

链接失效反馈

官方服务：

资源简介：

# What is FMA-rank? FMA is a music dataset from the Free Music Archive, containing over 8000 hours of Creative Commons-licensed music from 107k tracks across 16k artists and 15k albums. It was created in 2017 by [Defferrard et al.](https://arxiv.org/abs/1612.01840) in collaboration with [Free Music Archive](https://freemusicarchive.org/). FMA contains a lot of good music, and a lot of bad music, so the question is: can we rank the samples in FMA? FMA-rank is a CLAP-based statistical ranking of each sample in FMA. We calculate the log-likelihood of each sample in FMA belonging to an estimated gaussian in the CLAP latent space, using these values we can rank and filter FMA. In log-likelihood, higher values are better. # Quickstart Download any FMA split from the official github https://github.com/mdeff/fma. Extract the FMA folder from the downloaded zip and set the path to the folder in `fma_root_dir`. Run the following code snippet to load and filter the FMA samples according to the given percentages. The code snippet will return a HF audio dataset. ```Python from datasets import load_dataset, Dataset, Audio import os # provide location of fma folder fma_root_dir = "/path/to/fma/folder" # provide percentage of fma dataset to use # for whole dataset, use start_percentage=0 and end_percentage=100 # for worst 20% of dataset, use start_percentage=0 and end_percentage=20 # for best 20% of dataset, use the following values: start_percentage = 80 end_percentage = 100 # load fma_rank.csv from huggingface and sort from lowest to highest csv_loaded = load_dataset("disco-eth/FMA-rank") fma_item_list = csv_loaded["train"] fma_sorted_list = sorted(fma_item_list, key=lambda d: d['CLAP-log-likelihood']) def parse_fma_audio_folder(fma_root_dir): valid_fma_ids = [] subfolders = os.listdir(fma_root_dir) for subfolder in subfolders: subfolder_path = os.path.join(fma_root_dir, subfolder) if os.path.isdir(subfolder_path): music_files = os.listdir(subfolder_path) for music_file in music_files: if ".mp3" not in music_file: continue else: fma_id = music_file.split('.')[0] valid_fma_ids.append(fma_id) return valid_fma_ids # select the existing files according to the provided fma folder valid_fma_ids = parse_fma_audio_folder(fma_root_dir) df_dict = {"id":[], "score": [], "audio": []} for fma_item in fma_sorted_list: this_id = f"{fma_item['id']:06d}" if this_id in valid_fma_ids: df_dict["id"].append(this_id) df_dict["score"].append(fma_item["CLAP-log-likelihood"]) df_dict["audio"].append(os.path.join(fma_root_dir, this_id[:3] , this_id+".mp3")) # filter the fma dataset according to the percentage defined above i_start = int(start_percentage * len(df_dict["id"]) / 100) i_end = int(end_percentage * len(df_dict["id"]) / 100) df_dict_filtered = { "id": df_dict["id"][i_start:i_end], "score": df_dict["score"][i_start:i_end], "audio": df_dict["audio"][i_start:i_end], } # get final dataset audio_dataset = Dataset.from_dict(df_dict_filtered).cast_column("audio", Audio()) """ Dataset({ features: ['id', 'score', 'audio'], num_rows: 1599 }) """ ```

# 什么是FMA-rank？ FMA是源自免费音乐档案馆（Free Music Archive）的音乐数据集，包含超8000小时的知识共享许可（Creative Commons）授权音乐，涵盖10.7万首曲目、1.6万名艺术家与1.5万张专辑。该数据集于2017年由Defferrard等人与Free Music Archive合作创建，相关学术论文可参见[Defferrard et al.](https://arxiv.org/abs/1612.01840)，官方网站为[Free Music Archive](https://freemusicarchive.org/)。 FMA数据集同时包含优质与劣质音乐，由此引出一个核心问题：能否对FMA中的样本进行质量排序？ FMA-rank是一种基于CLAP（对比语言-音频预训练模型，Contrastive Language-Audio Pretraining）的FMA样本统计排序方法。我们通过计算FMA中每个样本属于CLAP隐空间中估计高斯分布的对数似然值，以此实现对FMA数据集的排序与筛选。对数似然值越高，代表样本质量越好。 # 快速上手从官方GitHub仓库https://github.com/mdeff/fma下载任意FMA数据集拆分包，将压缩包内的FMA文件夹解压，并在`fma_root_dir`变量中指定该文件夹的路径。运行以下代码片段即可根据指定比例加载并筛选FMA样本，该代码将返回一个Hugging Face音频数据集。 Python from datasets import load_dataset, Dataset, Audio import os # 指定FMA数据集文件夹的路径 fma_root_dir = "/path/to/fma/folder" # 指定需使用的FMA数据集比例 # 若需使用完整数据集，设置start_percentage=0且end_percentage=100 # 若需使用数据集质量最差的20%，设置start_percentage=0且end_percentage=20 # 若需使用数据集质量最优的20%，可使用如下参数： start_percentage = 80 end_percentage = 100 # 从Hugging Face加载fma_rank.csv文件，并按CLAP对数似然值升序排序 csv_loaded = load_dataset("disco-eth/FMA-rank") fma_item_list = csv_loaded["train"] fma_sorted_list = sorted(fma_item_list, key=lambda d: d['CLAP-log-likelihood']) def parse_fma_audio_folder(fma_root_dir): valid_fma_ids = [] subfolders = os.listdir(fma_root_dir) for subfolder in subfolders: subfolder_path = os.path.join(fma_root_dir, subfolder) if os.path.isdir(subfolder_path): music_files = os.listdir(subfolder_path) for music_file in music_files: if ".mp3" not in music_file: continue else: fma_id = music_file.split('.')[0] valid_fma_ids.append(fma_id) return valid_fma_ids # 根据指定的FMA文件夹筛选存在的音频文件 valid_fma_ids = parse_fma_audio_folder(fma_root_dir) df_dict = {"id":[], "score": [], "audio": []} for fma_item in fma_sorted_list: this_id = f"{fma_item['id']:06d}" if this_id in valid_fma_ids: df_dict["id"].append(this_id) df_dict["score"].append(fma_item["CLAP-log-likelihood"]) df_dict["audio"].append(os.path.join(fma_root_dir, this_id[:3] , this_id+".mp3")) # 根据上述定义的比例筛选FMA数据集 i_start = int(start_percentage * len(df_dict["id"]) / 100) i_end = int(end_percentage * len(df_dict["id"]) / 100) df_dict_filtered = { "id": df_dict["id"][i_start:i_end], "score": df_dict["score"][i_start:i_end], "audio": df_dict["audio"][i_start:i_end], } # 生成最终数据集 audio_dataset = Dataset.from_dict(df_dict_filtered).cast_column("audio", Audio()) """ 数据集信息({ 特征列: ['id', 'score', 'audio'], 样本总数: 1599 }) """

提供机构：

maas

创建时间：

2025-05-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集