FMA-rank
收藏魔搭社区2025-11-27 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/disco-eth/FMA-rank
下载链接
链接失效反馈官方服务:
资源简介:
# What is FMA-rank?
FMA is a music dataset from the Free Music Archive, containing over 8000 hours of Creative Commons-licensed music from 107k tracks across 16k artists and 15k albums.
It was created in 2017 by [Defferrard et al.](https://arxiv.org/abs/1612.01840) in collaboration with [Free Music Archive](https://freemusicarchive.org/).
FMA contains a lot of good music, and a lot of bad music, so the question is: can we rank the samples in FMA?
FMA-rank is a CLAP-based statistical ranking of each sample in FMA. We calculate the log-likelihood of each sample in FMA belonging to an estimated gaussian in the CLAP latent space, using these values we can rank and filter FMA. In log-likelihood, higher values are better.
# Quickstart
Download any FMA split from the official github https://github.com/mdeff/fma. Extract the FMA folder from the downloaded zip and set the path to the folder in `fma_root_dir`.
Run the following code snippet to load and filter the FMA samples according to the given percentages. The code snippet will return a HF audio dataset.
```Python
from datasets import load_dataset, Dataset, Audio
import os
# provide location of fma folder
fma_root_dir = "/path/to/fma/folder"
# provide percentage of fma dataset to use
# for whole dataset, use start_percentage=0 and end_percentage=100
# for worst 20% of dataset, use start_percentage=0 and end_percentage=20
# for best 20% of dataset, use the following values:
start_percentage = 80
end_percentage = 100
# load fma_rank.csv from huggingface and sort from lowest to highest
csv_loaded = load_dataset("disco-eth/FMA-rank")
fma_item_list = csv_loaded["train"]
fma_sorted_list = sorted(fma_item_list, key=lambda d: d['CLAP-log-likelihood'])
def parse_fma_audio_folder(fma_root_dir):
valid_fma_ids = []
subfolders = os.listdir(fma_root_dir)
for subfolder in subfolders:
subfolder_path = os.path.join(fma_root_dir, subfolder)
if os.path.isdir(subfolder_path):
music_files = os.listdir(subfolder_path)
for music_file in music_files:
if ".mp3" not in music_file:
continue
else:
fma_id = music_file.split('.')[0]
valid_fma_ids.append(fma_id)
return valid_fma_ids
# select the existing files according to the provided fma folder
valid_fma_ids = parse_fma_audio_folder(fma_root_dir)
df_dict = {"id":[], "score": [], "audio": []}
for fma_item in fma_sorted_list:
this_id = f"{fma_item['id']:06d}"
if this_id in valid_fma_ids:
df_dict["id"].append(this_id)
df_dict["score"].append(fma_item["CLAP-log-likelihood"])
df_dict["audio"].append(os.path.join(fma_root_dir, this_id[:3] , this_id+".mp3"))
# filter the fma dataset according to the percentage defined above
i_start = int(start_percentage * len(df_dict["id"]) / 100)
i_end = int(end_percentage * len(df_dict["id"]) / 100)
df_dict_filtered = {
"id": df_dict["id"][i_start:i_end],
"score": df_dict["score"][i_start:i_end],
"audio": df_dict["audio"][i_start:i_end],
}
# get final dataset
audio_dataset = Dataset.from_dict(df_dict_filtered).cast_column("audio", Audio())
"""
Dataset({
features: ['id', 'score', 'audio'],
num_rows: 1599
})
"""
```
# 什么是FMA-rank?
FMA是源自免费音乐档案馆(Free Music Archive)的音乐数据集,包含超8000小时的知识共享许可(Creative Commons)授权音乐,涵盖10.7万首曲目、1.6万名艺术家与1.5万张专辑。该数据集于2017年由Defferrard等人与Free Music Archive合作创建,相关学术论文可参见[Defferrard et al.](https://arxiv.org/abs/1612.01840),官方网站为[Free Music Archive](https://freemusicarchive.org/)。
FMA数据集同时包含优质与劣质音乐,由此引出一个核心问题:能否对FMA中的样本进行质量排序?
FMA-rank是一种基于CLAP(对比语言-音频预训练模型,Contrastive Language-Audio Pretraining)的FMA样本统计排序方法。我们通过计算FMA中每个样本属于CLAP隐空间中估计高斯分布的对数似然值,以此实现对FMA数据集的排序与筛选。对数似然值越高,代表样本质量越好。
# 快速上手
从官方GitHub仓库https://github.com/mdeff/fma下载任意FMA数据集拆分包,将压缩包内的FMA文件夹解压,并在`fma_root_dir`变量中指定该文件夹的路径。运行以下代码片段即可根据指定比例加载并筛选FMA样本,该代码将返回一个Hugging Face音频数据集。
Python
from datasets import load_dataset, Dataset, Audio
import os
# 指定FMA数据集文件夹的路径
fma_root_dir = "/path/to/fma/folder"
# 指定需使用的FMA数据集比例
# 若需使用完整数据集,设置start_percentage=0且end_percentage=100
# 若需使用数据集质量最差的20%,设置start_percentage=0且end_percentage=20
# 若需使用数据集质量最优的20%,可使用如下参数:
start_percentage = 80
end_percentage = 100
# 从Hugging Face加载fma_rank.csv文件,并按CLAP对数似然值升序排序
csv_loaded = load_dataset("disco-eth/FMA-rank")
fma_item_list = csv_loaded["train"]
fma_sorted_list = sorted(fma_item_list, key=lambda d: d['CLAP-log-likelihood'])
def parse_fma_audio_folder(fma_root_dir):
valid_fma_ids = []
subfolders = os.listdir(fma_root_dir)
for subfolder in subfolders:
subfolder_path = os.path.join(fma_root_dir, subfolder)
if os.path.isdir(subfolder_path):
music_files = os.listdir(subfolder_path)
for music_file in music_files:
if ".mp3" not in music_file:
continue
else:
fma_id = music_file.split('.')[0]
valid_fma_ids.append(fma_id)
return valid_fma_ids
# 根据指定的FMA文件夹筛选存在的音频文件
valid_fma_ids = parse_fma_audio_folder(fma_root_dir)
df_dict = {"id":[], "score": [], "audio": []}
for fma_item in fma_sorted_list:
this_id = f"{fma_item['id']:06d}"
if this_id in valid_fma_ids:
df_dict["id"].append(this_id)
df_dict["score"].append(fma_item["CLAP-log-likelihood"])
df_dict["audio"].append(os.path.join(fma_root_dir, this_id[:3] , this_id+".mp3"))
# 根据上述定义的比例筛选FMA数据集
i_start = int(start_percentage * len(df_dict["id"]) / 100)
i_end = int(end_percentage * len(df_dict["id"]) / 100)
df_dict_filtered = {
"id": df_dict["id"][i_start:i_end],
"score": df_dict["score"][i_start:i_end],
"audio": df_dict["audio"][i_start:i_end],
}
# 生成最终数据集
audio_dataset = Dataset.from_dict(df_dict_filtered).cast_column("audio", Audio())
"""
数据集信息({
特征列: ['id', 'score', 'audio'],
样本总数: 1599
})
"""
提供机构:
maas
创建时间:
2025-05-21



