fama-data
收藏魔搭社区2025-11-27 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/FBK-MT/fama-data
下载链接
链接失效反馈官方服务:
资源简介:
<img src="https://huggingface.co/FBK-MT/fama-small/resolve/main/FAMA.png" align="center" width="100%">
### Dataset Description, Collection, and Source
The FAMA training data is the collection of English and Italian datasets for automatic speech recognition (ASR) and speech translation (ST)
used to train the [FAMA models family](https://huggingface.co/collections/FBK-MT/fama-683425df3fb2b3171e0cdc9e).
The ASR section of FAMA is derived from the [MOSEL data collection](https://github.com/hlt-mt/mosel), including the automatic
transcripts obtained with Whisper and available in the [HuggingFace MOSEL Dataset](https://huggingface.co/datasets/FBK-MT/mosel).
The ASR is further augmented with automatically transcribed speech from the
[YouTube-Commons dataset](https://huggingface.co/datasets/PleIAs/YouTube-Commons).
The ST section is composed of gold-labeled ST datasets and the automatic translations of the ASR datasets with
[MADALAD-400 3B-MT](https://huggingface.co/google/madlad400-3b-mt).
The complete list of datasets for both tasks are reported in the [Dataset Statistics](#dataset-statistics).
- **Curated by:** Sara Papi, Marco Gaido, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, and Matteo Negri
- **Funded by:** FAIR, Meetween, and CINECA
- **Shared by:** Fondazione Bruno Kessler
### License
- CC-BY-4.0
### Dataset Sources
- **MOSEL Collection:** [MOSEL GitHub](https://github.com/hlt-mt/mosel)
- **MOSEL Pseudolabels:** [MOSEL HuggingFace](https://huggingface.co/datasets/FBK-MT/mosel)
- **YouTube-Commons:** [YouTube-Commons](https://huggingface.co/datasets/PleIAs/YouTube-Commons)
- **Paper:** [FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian](https://huggingface.co/papers/2505.22759)
## Dataset Structure
### Data Config
The dataset is split into multiple tsv files corresponding to the dataset name and the source and target languages,
either Italian (it) and English (en), containing both the ASR transcript and translation in the other language.
### Data Field
`id`: unique id of the segment (text, e.g.: "5NTUCHeZuds_0")
`audio`: filename (text, e.g. "5NTUCHeZuds.wav")
`offset`: start of the segment, in seconds (float, e.g. "0.020")
`duration`: duration of the segments, in seconds (float, e.g. "5.946")
`speaker`: id of the speaker (text, e.g. "000")
`src_lang`: id of the source language (ISO 639-1 code, e.g. "it", "en")
`src_text`: recognized text (text, e.g. "Grazie a tutti.")
`tgt_lang`: id of the source language (ISO 639-1 code, e.g. "it", "en")
`tgt_text`: translated text (text, e.g. "Thank you all.")
`ASR`: True/False - indicates whether the sample has been used for ASR training
`ST`: True/False - indicates whether the sample has been used for ST training
## Dataset Statistics
The full list of FAMA training datasets, together with the number of hours for each language/language pair and
the type of labels (A for automatic and G for gold labels) is reported below for both ASR and ST tasks.
### Automatic Speech Recognition (ASR)
| Dataset | English (h) | Italian (h) | Label |
|--------|--------|--------|-------|
| CommonVoice v18 | 1,746 | 250 | G |
| CoVoST2 | 420 | 28 | G |
| FLEURS | 7 | 9 | G |
| LibriSpeech | 358 | - | G |
| MOSEL | 66,301 | 21,775 | A |
| MLS | 44,600 | 247 | G |
| VoxPopuli-ASR | 519 | 74 | G |
| YouTube-Commons | 14,200 | 1,828 | A |
| **TOTAL** | 128,152 | 24,211 | G+A |
### Speech Translation (ST)
| Dataset | English (h) | Italian (h) | Label |
|--------|--------|--------|-------|
| CommonVoice v18 | 1,746 | 250 | A |
| CoVoST2 | 420 | 28 | A |
| LibriSpeech | 358 | - | A |
| MOSEL | 66,301 | 21,775 | A |
| MLS | 44,600 | 247 | A |
| VoxPopuli-ASR | 519 | 74 | A |
| YouTube-Commons | 14,200 | 1,828 | A |
| *TOTAL (A)* | 128,144 | 24,202 | A |
| *FILTERED (A)* | 123,777 | 23,445 | A |
| CoVoST2 | 420 | 28 | G |
| FLEURS | 7 | 9 | G |
| **TOTAL** | 124,204 | 23,482 | G+A |
## Dataset Creation
To reproduce the MOSEL-derived datasets (all but YouTube-Commons), please refer to the
[MOSEL README in the fbk-llm](https://github.com/hlt-mt/fbk-llm) repository and to the
[MOSEL data card on HuggingFace](https://huggingface.co/datasets/FBK-MT/mosel).
To download and process YouTube-Commons, please refer to the
[dedicated YouTube-Commons README](https://huggingface.co/datasets/FBK-MT/fama-data/blob/main/scripts/YouTube-Commons-README.md).
The code used to produce all translations with [MADALAD-400 3B-MT](https://huggingface.co/google/madlad400-3b-mt) is the following:
```python
import os
import sys
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
modelname = "google/madlad400-3b-mt"
batch_size = {$BATCH_SIZE}
tlang = {$LANGUAGE}
class BatchedMT:
def __init__(self, tokenizer, model):
self.buffer_lines = []
self.model = model
if torch.cuda.is_available():
self.model = self.model.cuda()
self.tokenizer = tokenizer
def process_line(self, line):
self.buffer_lines.append(line.strip())
if len(self.buffer_lines) >= BATCHSIZE:
self.print_translations()
self.buffer_lines = []
def print_translations(self):
outs = self._do_translate()
for s in outs:
print(s)
def _do_translate(self):
tokens = self.tokenizer(self.buffer_lines, return_tensors="pt", padding=True)
if torch.cuda.is_available():
tokens = {k: v.cuda() for k, v in tokens.items()}
translated = self.model.generate(**tokens, max_new_tokens=512)
return [self.tokenizer.decode(t, skip_special_tokens=True) for t in translated]
def close(self):
if len(self.buffer_lines) > 0:
self.print_translations()
self.buffer_lines = []
mt = BatchedMT(
AutoTokenizer.from_pretrained(modelname),
AutoModelForSeq2SeqLM.from_pretrained(modelname))
for input_line in sys.stdin:
mt.process_line("<2" + tlang + "> " + input_line)
mt.close()
```
where the input text is passad as stdin, `{$BATCH_SIZE}` is the batch size supported on your machine
and `{$LANGUAGE}` is either `en` for Italian to English translation and `it` for English to Italian translation.
The script used for filtering the ST datasets is
[`filter_tsv_based_on_ratio`](https://huggingface.co/datasets/FBK-MT/fama-data/blob/main/scripts/filter_tsv_based_on_ratio.py) and
available in the `scripts` folder of this repository.
For English speech datasets, we set `--threshold-min 0.75` and `--threshold-max 1.45`
while, for the Italian speech datasets, `--threshold-min 0.65` and `--threshold-max 1.35`.
## Citation
```
@misc{papi2025fama,
title={FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian},
author={Sara Papi and Marco Gaido and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabih and Matteo Negri},
year={2025}
}
```
## Dataset Card Contact
[@spapi](https://huggingface.co/spapi)
<img src="https://huggingface.co/FBK-MT/fama-small/resolve/main/FAMA.png" align="center" width="100%">
### 数据集描述、收集与来源
FAMA训练数据集为面向自动语音识别(Automatic Speech Recognition, ASR)与语音翻译(Speech Translation, ST)任务的英语及意大利语数据集集合,用于训练[FAMA模型系列](https://huggingface.co/collections/FBK-MT/fama-683425df3fb2b3171e0cdc9e)。
FAMA的ASR模块源自[MOSEL数据集集合](https://github.com/hlt-mt/mosel),包含通过Whisper生成的自动转录文本,相关数据可在[HuggingFace MOSEL数据集](https://huggingface.co/datasets/FBK-MT/mosel)中获取。此外,ASR数据集还通过[YouTube-Commons数据集](https://huggingface.co/datasets/PleIAs/YouTube-Commons)中的自动转录语音进行了扩充。
ST模块由带金标准标签的ST数据集,以及使用[MADALAD-400 3B-MT](https://huggingface.co/google/madlad400-3b-mt)生成的ASR数据集自动翻译结果共同组成。两项任务的完整数据集列表详见[数据集统计信息](#dataset-statistics)。
- **数据整理者:** Sara Papi, Marco Gaido, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, 和 Matteo Negri
- **资助方:** FAIR、Meetween及CINECA
- **发布方:** 布鲁诺·凯塞勒基金会(Fondazione Bruno Kessler)
### 许可证
- CC-BY-4.0
### 数据集来源
- **MOSEL数据集集合:** [MOSEL GitHub仓库](https://github.com/hlt-mt/mosel)
- **MOSEL伪标签数据:** [HuggingFace平台MOSEL数据集](https://huggingface.co/datasets/FBK-MT/mosel)
- **YouTube-Commons数据集:** [YouTube-Commons数据集](https://huggingface.co/datasets/PleIAs/YouTube-Commons)
- **相关论文:** [FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian](https://huggingface.co/papers/2505.22759)
## 数据集结构
### 数据配置
该数据集拆分为多个TSV文件,文件名对应数据集名称以及源语言与目标语言(意大利语`it`、英语`en`),文件中同时包含ASR转录文本与对应另一语言的翻译结果。
### 数据字段
`id`:片段唯一标识符(文本类型,示例:`"5NTUCHeZuds_0"`)
`audio`:音频文件名(文本类型,示例:`"5NTUCHeZuds.wav"`)
`offset`:片段起始时间,单位为秒(浮点型,示例:`"0.020"`)
`duration`:片段时长,单位为秒(浮点型,示例:`"5.946"`)
`speaker`:说话人标识符(文本类型,示例:`"000"`)
`src_lang`:源语言标识符(ISO 639-1编码,示例:`"it"`、`"en"`)
`src_text`:识别得到的源语言文本(文本类型,示例:`"Grazie a tutti."`)
`tgt_lang`:目标语言标识符(ISO 639-1编码,示例:`"it"`、`"en"`)
`tgt_text`:翻译得到的目标语言文本(文本类型,示例:`"Thank you all."`)
`ASR`:布尔值(True/False)—— 指示该样本是否用于ASR模型训练
`ST`:布尔值(True/False)—— 指示该样本是否用于ST模型训练
## 数据集统计信息
FAMA训练数据集的完整列表,以及各语言/语言对的时长(小时)与标签类型(A代表自动生成标签,G代表金标准标签)已按ASR与ST任务分别整理如下。
### 自动语音识别(ASR)
| 数据集 | 英语(小时) | 意大利语(小时) | 标签类型 |
|--------|--------|--------|-------|
| CommonVoice v18 | 1,746 | 250 | G |
| CoVoST2 | 420 | 28 | G |
| FLEURS | 7 | 9 | G |
| LibriSpeech | 358 | - | G |
| MOSEL | 66,301 | 21,775 | A |
| MLS | 44,600 | 247 | G |
| VoxPopuli-ASR | 519 | 74 | G |
| YouTube-Commons | 14,200 | 1,828 | A |
| **总计** | 128,152 | 24,211 | G+A |
### 语音翻译(ST)
| 数据集 | 英语(小时) | 意大利语(小时) | 标签类型 |
|--------|--------|--------|-------|
| CommonVoice v18 | 1,746 | 250 | A |
| CoVoST2 | 420 | 28 | A |
| LibriSpeech | 358 | - | A |
| MOSEL | 66,301 | 21,775 | A |
| MLS | 44,600 | 247 | A |
| VoxPopuli-ASR | 519 | 74 | A |
| YouTube-Commons | 14,200 | 1,828 | A |
| *总计(自动标签)* | 128,144 | 24,202 | A |
| *过滤后数据(自动标签)* | 123,777 | 23,445 | A |
| CoVoST2 | 420 | 28 | G |
| FLEURS | 7 | 9 | G |
| **总计** | 124,204 | 23,482 | G+A |
## 数据集构建
若需复现MOSEL衍生数据集(除YouTube-Commons外的所有数据集),请参考[fbk-llm仓库中的MOSEL README文档](https://github.com/hlt-mt/fbk-llm)以及[HuggingFace平台上的MOSEL数据集卡片](https://huggingface.co/datasets/FBK-MT/mosel)。
若需下载并处理YouTube-Commons数据集,请参考[专用YouTube-Commons README文档](https://huggingface.co/datasets/FBK-MT/fama-data/blob/main/scripts/YouTube-Commons-README.md)。
使用[MADALAD-400 3B-MT](https://huggingface.co/google/madlad400-3b-mt)生成所有翻译的代码如下:
python
import os
import sys
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
modelname = "google/madlad400-3b-mt"
batch_size = {$BATCH_SIZE}
tlang = {$LANGUAGE}
class BatchedMT:
def __init__(self, tokenizer, model):
self.buffer_lines = []
self.model = model
if torch.cuda.is_available():
self.model = self.model.cuda()
self.tokenizer = tokenizer
def process_line(self, line):
self.buffer_lines.append(line.strip())
if len(self.buffer_lines) >= BATCHSIZE:
self.print_translations()
self.buffer_lines = []
def print_translations(self):
outs = self._do_translate()
for s in outs:
print(s)
def _do_translate(self):
tokens = self.tokenizer(self.buffer_lines, return_tensors="pt", padding=True)
if torch.cuda.is_available():
tokens = {k: v.cuda() for k, v in tokens.items()}
translated = self.model.generate(**tokens, max_new_tokens=512)
return [self.tokenizer.decode(t, skip_special_tokens=True) for t in translated]
def close(self):
if len(self.buffer_lines) > 0:
self.print_translations()
self.buffer_lines = []
mt = BatchedMT(
AutoTokenizer.from_pretrained(modelname),
AutoModelForSeq2SeqLM.from_pretrained(modelname))
for input_line in sys.stdin:
mt.process_line("<2" + tlang + "> " + input_line)
mt.close()
其中输入文本以标准输入(stdin)形式传入,`{$BATCH_SIZE}`为当前设备支持的批次大小,`{$LANGUAGE}`为目标语言:若需实现意大利语到英语的翻译则设为`en`,若需实现英语到意大利语的翻译则设为`it`。
用于过滤ST数据集的脚本为[`filter_tsv_based_on_ratio`](https://huggingface.co/datasets/FBK-MT/fama-data/blob/main/scripts/filter_tsv_based_on_ratio.py),位于本仓库的`scripts`文件夹中。针对英语语音数据集,我们设置参数`--threshold-min 0.75`与`--threshold-max 1.45`;针对意大利语语音数据集,则设置`--threshold-min 0.65`与`--threshold-max 1.35`。
## 引用格式
@misc{papi2025fama,
title={FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian},
author={Sara Papi and Marco Gaido and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabih and Matteo Negri},
year={2025}
}
## 数据集卡片联系人
[@spapi](https://huggingface.co/spapi)
提供机构:
maas
创建时间:
2025-09-26



