wav2gloss/fieldwork
收藏Hugging Face2024-08-01 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/wav2gloss/fieldwork
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
dataset_info:
features:
- name: id
dtype: string
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: transcription
dtype: string
- name: language
dtype: string
- name: speaker
dtype: string
- name: surface
dtype: string
- name: underlying
dtype: string
- name: gloss
dtype: string
- name: translation
dtype: string
- name: translation_language
dtype: string
- name: length
dtype: float32
- name: discard
dtype: bool
splits:
- name: train
num_bytes: 4841476668.601
num_examples: 48987
- name: validation
num_bytes: 879881255.295
num_examples: 7715
- name: test
num_bytes: 2556166473.915
num_examples: 23759
download_size: 8175211998
dataset_size: 8277524397.811
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
---
# Wav2Gloss Fieldwork Corpus
## Description
The Wav2Gloss Fieldwork corpus is a collection of linguistic field recordings which have been previously transcribed and glossed. The dataset is used by Wav2Gloss project to develop machine learning models that can automatically generate transcriptions, morphological segmentations, glosses, and translations, with the goal of helping linguists annotate field data.
- 10-minute presentation: https://drive.google.com/file/d/1nSelG6a56924JYMnnvKysb805R-lSZIa/view?usp=sharing
- Slides: https://github.com/juice500ml/finetune_owsm/blob/main/slides.pdf
- Poster: https://github.com/juice500ml/finetune_owsm/blob/main/poster.pdf
## Statistics
See below for a breakdown of languages by training and dev/test hours.
| Glottocode | Name | CC Type | Train (h) | Dev+Test (h) |
| ---------- | -------------------- | -------- | --------- | ------------ |
| `beja1238` | Beja | BY-NC | 1.55 | 0.29 |
| `ruul1235` | Ruuli | BY | 0.96 | 0.28 |
| `texi1237` | Texistepec Popoluca | BY | 0.84 | 0.26 |
| `komn1238` | Komnzo | BY | 0.73 | 0.42 |
| `arap1274` | Arapaho | BY | 0.56 | 0.88 |
| `goro1270` | Gorwaa | BY | 0.52 | 0.45 |
| `teop1238` | Teop | BY | 0.52 | 0.52 |
| `nngg1234` | Nǁng | BY | 0.52 | 0.33 |
| `sumi1235` | Sümi | BY | 0.40 | 0.40 |
| `jeju1234` | Jejuan | BY | 0.38 | 0.65 |
| `bora1263` | Bora | BY | 0.23 | 1.44 |
| `apah1238` | Yali (Apahapsili) | BY-NC-SA | 0.18 | 0.27 |
| `port1286` | Daakie | BY | 0.14 | 0.75 |
| `savo1255` | Savosavo | BY | 0.10 | 1.20 |
| `trin1278` | Mojeño Trinitario | BY | - | 1.56 |
| `sout2856` | Nafsan (South Efate) | BY-NC-SA | - | 1.55 |
| `pnar1238` | Pnar | BY-NC | - | 0.91 |
| `kaka1265` | Kakabe | BY | - | 0.90 |
| `vera1241` | Vera'a | BY | 1.02 | 0.97 |
| `tond1251` | Tondano | BY | 0.22 | 0.67 |
| `taul1251` | Tulil | BY | - | 1.18 |
| `arta1239` | Arta | BY | - | 0.91 |
| `nort2641` | Northern Kurdish | BY | - | 0.86 |
| `tehr1242` | Persian | BY | - | 0.82 |
| `taba1259` | Tabasaran | BY | - | 0.79 |
| `sanz1248` | Sanzhi Dargwa | BY | - | 0.67 |
| `kach1280` | Jinghpaw | BY | - | 0.66 |
| `mand1415` | Mandarin | BY | - | 0.66 |
| `sumb1241` | Sumbawa | BY | - | 0.63 |
| `kara1499` | Kalamang | BY | - | 0.59 |
| `slav1254` | Slavomolisano | BY-NC | 1.01 | 0.96 |
| `balk1252` | Balkan Romani | BY-NC-SA | - | 0.35 |
| `dolg1241` | Dolgan | BY-NC-SA | 11.64 | 1.23 |
| `kama1378` | Kamas | BY-NC-SA | 9.91 | 1.15 |
| `selk1253` | Selkup | BY-NC-SA | 1.70 | 1.15 |
| `even1259` | Evenki | BY-NC-SA | 1.54 | 1.13 |
| `ainu1240` | Ainu | BY-SA | 7.12 | 1.13 |
## Baseline Systems
- XLS-R, WavLM: https://github.com/juice500ml/espnet/tree/wav2gloss/
- OWSM: https://github.com/juice500ml/finetune_owsm
## Citation
```bibtex
@inproceedings{he-etal-2024-wav2gloss,
title = "Wav2Gloss: Generating Interlinear Glossed Text from Speech",
author = "Taiqi He and Kwanghee Choi and Lindia Tjuatja and Nathaniel R. Robinson and Jiatong Shi and Shinji Watanabe and Graham Neubig and David R. Mortensen and Lori Levin",
booktitle = "Proceedings of the 62st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2024",
publisher = "Association for Computational Linguistics",
}
```
## Corpora citations
#### Yali (Apahapsili) (apah1238)
```bibtex
@incollection{doreco-apah1238,
address = {Berlin \& Lyon},
author = {Riesberg, Sonja},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Yali (Apahapsili) DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/apah1238},
doi = {10.34847/nkl.9d91nkq2},
urldate = {07/10/2023},
year = {2022}
}
```
#### Arapaho (arap1274)
```bibtex
@incollection{doreco-arap1274,
address = {Berlin \& Lyon},
author = {Cowell, Andrew},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Arapaho DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/arap1274},
doi = {10.34847/nkl.36f5r1b6},
urldate = {07/10/2023},
year = {2022}
}
```
#### Beja (beja1238)
```bibtex
@incollection{doreco-beja1238,
address = {Berlin \& Lyon},
author = {Vanhove, Martine},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Beja DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/beja1238},
doi = {10.34847/nkl.edd011t1},
urldate = {07/10/2023},
year = {2022}
}
```
#### Bora (bora1263)
```bibtex
@incollection{doreco-bora1263,
address = {Berlin \& Lyon},
author = {Seifart, Frank},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Bora DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/bora1263},
doi = {10.34847/nkl.6eaf5laq},
urldate = {07/10/2023},
year = {2022}
}
```
#### Gorwaa (goro1270)
```bibtex
@incollection{doreco-goro1270,
address = {Berlin \& Lyon},
author = {Harvey, Andrew},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Gorwaa DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/goro1270},
doi = {10.34847/nkl.a4b4ijj2},
urldate = {07/10/2023},
year = {2022}
}
```
#### Jejuan (jeju1234)
```bibtex
@incollection{doreco-jeju1234,
address = {Berlin \& Lyon},
author = {Kim, Soung-U},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Jejuan DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/jeju1234},
doi = {10.34847/nkl.06ebrk38},
urldate = {07/10/2023},
year = {2022}
}
```
#### Kakabe (kaka1265)
```bibtex
@incollection{doreco-kaka1265,
address = {Berlin \& Lyon},
author = {Vydrina, Alexandra},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Kakabe DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/kaka1265},
doi = {10.34847/nkl.d5aeu9t6},
urldate = {07/10/2023},
year = {2022}
}
```
#### Komnzo (komn1238)
```bibtex
@incollection{doreco-komn1238,
address = {Berlin \& Lyon},
author = {Döhler, Christian},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Komnzo DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/komn1238},
doi = {10.34847/nkl.c5e6dudv},
urldate = {07/10/2023},
year = {2022}
}
```
#### Nǁng (nngg1234)
```bibtex
@incollection{doreco-nngg1234,
address = {Berlin \& Lyon},
author = {Güldemann, Tom and Ernszt, Martina and Siegmund, Sven and Witzlack-Makarevich, Alena},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Nǁng DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/nngg1234},
doi = {10.34847/nkl.f6c37fi0},
urldate = {07/10/2023},
year = {2022}
}
```
#### Pnar (pnar1238)
```bibtex
@incollection{doreco-pnar1238,
address = {Berlin \& Lyon},
author = {Ring, Hiram},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Pnar DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/pnar1238},
doi = {10.34847/nkl.5ba1062k},
urldate = {07/10/2023},
year = {2022}
}
```
#### Daakie (port1286)
```bibtex
@incollection{doreco-port1286,
address = {Berlin \& Lyon},
author = {Krifka, Manfred},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Daakie DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/port1286},
doi = {10.34847/nkl.efeav5l9},
urldate = {07/10/2023},
year = {2022}
}
```
#### Ruuli (ruul1235)
```bibtex
@incollection{doreco-ruul1235,
address = {Berlin \& Lyon},
author = {Witzlack-Makarevich, Alena and Namyalo, Saudah and Kiriggwajjo, Anatol and Molochieva, Zarina},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Ruuli DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/ruul1235},
doi = {10.34847/nkl.fde4pp1u},
urldate = {07/10/2023},
year = {2022}
}
```
#### Savosavo (savo1255)
```bibtex
@incollection{doreco-savo1255,
address = {Berlin \& Lyon},
author = {Wegener, Claudia},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Savosavo DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/savo1255},
doi = {10.34847/nkl.b74d1b33},
urldate = {07/10/2023},
year = {2022}
}
```
#### Nafsan (South Efate) (sout2856)
```bibtex
@incollection{doreco-sout2856,
address = {Berlin \& Lyon},
author = {Thieberger, Nick},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Nafsan (South Efate) DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/sout2856},
doi = {10.34847/nkl.ba4f760l},
urldate = {07/10/2023},
year = {2022}
}
```
#### Sümi (sumi1235)
```bibtex
@incollection{doreco-sumi1235,
address = {Berlin \& Lyon},
author = {Teo, Amos},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Sümi DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/sumi1235},
doi = {10.34847/nkl.5ad4t01p},
urldate = {07/10/2023},
year = {2022}
}
```
#### Teop (teop1238)
```bibtex
@incollection{doreco-teop1238,
address = {Berlin \& Lyon},
author = {Mosel, Ulrike},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Teop DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/teop1238},
doi = {10.34847/nkl.9322sdf2},
urldate = {07/10/2023},
year = {2022}
}
```
#### Texistepec Popoluca (texi1237)
```bibtex
@incollection{doreco-texi1237,
address = {Berlin \& Lyon},
author = {Wichmann, Søren},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Texistepec Popoluca DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/texi1237},
doi = {10.34847/nkl.c50ck58f},
urldate = {07/10/2023},
year = {2022}
}
```
#### Mojeño Trinitario (trin1278)
```bibtex
@incollection{doreco-trin1278,
address = {Berlin \& Lyon},
author = {Rose, Françoise},
booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2},
editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew},
publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)},
title = {Mojeño Trinitario DoReCo dataset},
url = {https://doreco.huma-num.fr/languages/trin1278},
doi = {10.34847/nkl.cbc3b4xr},
urldate = {07/10/2023},
year = {2022}
}
```
#### Dolgan (dolg1241)
```bibtex
@misc{inel-dolgan,
author = {Däbritz, Chris Lasse and
Kudryakova, Nina and
Stapert, Eugénie},
title = {INEL Dolgan Corpus},
month = nov,
year = 2022,
doi = {10.25592/uhhfdm.11165},
url = {https://doi.org/10.25592/uhhfdm.11165}
}
```
#### Evenki (even1259)
```bibtex
@misc{inel-evenki,
author = {Däbritz, Chris Lasse and
Gusev, Valentin},
title = {INEL Evenki Corpus},
month = dec,
year = 2021,
doi = {10.25592/uhhfdm.9628},
url = {https://doi.org/10.25592/uhhfdm.9628}
}
```
#### Kamas (kama1378)
```bibtex
@misc{inel-kamas,
author = {Gusev, Valentin and
Klooster, Tiina and
Wagner-Nagy, Beáta},
title = {INEL Kamas Corpus},
month = dec,
year = 2019,
doi = {10.25592/uhhfdm.9752},
url = {https://doi.org/10.25592/uhhfdm.9752}
}
```
#### Selkup (selk1253)
```bibtex
@misc{inel-selkup,
author = {Brykina, Maria and
Orlova, Svetlana and
Wagner-Nagy, Beáta},
title = {INEL Selkup Corpus},
month = dec,
year = 2021,
doi = {10.25592/uhhfdm.9754},
url = {https://doi.org/10.25592/uhhfdm.9754}
}
```
#### Arta (arta1239)
```bibtex
@incollection{arta1239,
author = {Kimoto, Yukinori},
title = {{Multi-CAST Arta}},
year = {2019},
editor = {Haig, Geoffrey and Schnell, Stefan},
booktitle = {{Multi-CAST}},
booksubtitle = {{Multilingual corpus of annotated spoken texts}},
note = {Version 2211}},
address = {Bamberg},
publisher = {University of Bamberg},
url = {multicast.aspra.uni-bamberg.de/#arta}
}
```
#### Jinghpaw (kach1280)
```bibtex
@incollection{kach1280,
author = {Kurabe, Keita},
title = {{Multi-CAST Jinghpaw}},
year = {2021},
editor = {Haig, Geoffrey and Schnell, Stefan},
booktitle = {{Multi-CAST}},
booksubtitle = {{Multilingual corpus of annotated spoken texts}},
note = {Version 2211}},
address = {Bamberg},
publisher = {University of Bamberg},
url = {multicast.aspra.uni-bamberg.de/#jinghpaw}
}
```
#### Kalamang (kara1499)
```bibtex
@incollection{kara1499,
author = {Visser, Eline},
title = {{Multi-CAST Kalamang}},
year = {2021},
editor = {Haig, Geoffrey and Schnell, Stefan},
booktitle = {{Multi-CAST}},
booksubtitle = {{Multilingual corpus of annotated spoken texts}},
note = {Version 2211}},
address = {Bamberg},
publisher = {University of Bamberg},
url = {multicast.aspra.uni-bamberg.de/#kalamang}
}
```
#### Mandarin (mand1415)
```bibtex
@incollection{mand1415,
author = {Vollmer, Maria},
title = {{Multi-CAST Mandarin}},
year = {2020},
editor = {Haig, Geoffrey and Schnell, Stefan},
booktitle = {{Multi-CAST}},
booksubtitle = {{Multilingual corpus of annotated spoken texts}},
note = {Version 2211}},
address = {Bamberg},
publisher = {University of Bamberg},
url = {multicast.aspra.uni-bamberg.de/#mandarin}
}
```
#### Northern Kurdish (nort2641)
```bibtex
@incollection{nort2641,
author = {Haig, Geoffrey and Vollmer, Maria and Thiele, Hanna},
title = {{Multi-CAST Northern Kurdish}},
year = {2015},
editor = {Haig, Geoffrey and Schnell, Stefan},
booktitle = {{Multi-CAST}},
booksubtitle = {{Multilingual corpus of annotated spoken texts}},
note = {Version 2211}},
address = {Bamberg},
publisher = {University of Bamberg},
url = {multicast.aspra.uni-bamberg.de/#nkurd}
}
```
#### Sanzhi Dargwa (sanz1248)
```bibtex
@incollection{sanz1248,
author = {Forker, Diana and Schiborr, Nils N.},
title = {{Multi-CAST Sanzhi Dargwa}},
year = {2019},
editor = {Haig, Geoffrey and Schnell, Stefan},
booktitle = {{Multi-CAST}},
booksubtitle = {{Multilingual corpus of annotated spoken texts}},
note = {Version 2211}},
address = {Bamberg},
publisher = {University of Bamberg},
url = {multicast.aspra.uni-bamberg.de/#sanzhi}
}
```
#### Sumbawa (sumb1241)
```bibtex
@incollection{sumb1241,
author = {Shiohara, Asako},
title = {{Multi-CAST Sumbawa}},
year = {2022},
editor = {Haig, Geoffrey and Schnell, Stefan},
booktitle = {{Multi-CAST}},
booksubtitle = {{Multilingual corpus of annotated spoken texts}},
note = {Version 2211}},
address = {Bamberg},
publisher = {University of Bamberg},
url = {multicast.aspra.uni-bamberg.de/#sumbawa}
}
```
#### Tabasaran (taba1259)
```bibtex
@incollection{taba1259,
author = {Bogomolova, Natalia & Ganenkov, Dmitry & Schiborr, Nils N.},
title = {{Multi-CAST Tabasaran}},
year = {2021},
editor = {Haig, Geoffrey and Schnell, Stefan},
booktitle = {{Multi-CAST}},
booksubtitle = {{Multilingual corpus of annotated spoken texts}},
note = {Version 2211}},
address = {Bamberg},
publisher = {University of Bamberg},
url = {multicast.aspra.uni-bamberg.de/#tabasaran}
}
```
#### Tulil (taul1251)
```bibtex
@incollection{taul1251,
author = {Meng, Chenxi},
title = {{Multi-CAST Tulil}},
year = {2016},
editor = {Haig, Geoffrey and Schnell, Stefan},
booktitle = {{Multi-CAST}},
booksubtitle = {{Multilingual corpus of annotated spoken texts}},
note = {Version 2211}},
address = {Bamberg},
publisher = {University of Bamberg},
url = {multicast.aspra.uni-bamberg.de/#tulil}
}
```
#### Persian (tehr1242)
```bibtex
@incollection{tehr1242,
author = {Adibifar, Shirin},
title = {{Multi-CAST Persian}},
year = {2016},
editor = {Haig, Geoffrey and Schnell, Stefan},
booktitle = {{Multi-CAST}},
booksubtitle = {{Multilingual corpus of annotated spoken texts}},
note = {Version 2211}},
address = {Bamberg},
publisher = {University of Bamberg},
url = {multicast.aspra.uni-bamberg.de/#persian}
}
```
#### Tondano (tond1251)
```bibtex
@incollection{tond1251,
author = {Brickell, Timothy},
title = {{Multi-CAST Tondano}},
year = {2016},
editor = {Haig, Geoffrey and Schnell, Stefan},
booktitle = {{Multi-CAST}},
booksubtitle = {{Multilingual corpus of annotated spoken texts}},
note = {Version 2211}},
address = {Bamberg},
publisher = {University of Bamberg},
url = {multicast.aspra.uni-bamberg.de/#tondano}
}
```
#### Vera'a (vera1241)
```bibtex
@incollection{vera1241,
author = {Schnell, Stefan},
title = {{Multi-CAST Vera'a}},
year = {2015},
editor = {Haig, Geoffrey and Schnell, Stefan},
booktitle = {{Multi-CAST}},
booksubtitle = {{Multilingual corpus of annotated spoken texts}},
note = {Version 2211}},
address = {Bamberg},
publisher = {University of Bamberg},
url = {multicast.aspra.uni-bamberg.de/#veraa}
}
```
#### Balkan Romani (balk1252)
```bibtex
@misc{balk1252,
title={{Le romani (xoraxane, vlax du sud, Grèce)}},
url={https://pangloss.cnrs.fr/corpus/Romani_(Xoraxane,_Southern_Vlax,_Greece)},
journal={La collection Pangloss},
author={Adamou, Evangelia}
}
```
#### Slavomolisano (slav1254)
```bibtex
@misc{slav1254,
title={{Na-našu (slave Molisan) : Le dialecte d’acquaviva collecroce}},
url={https://pangloss.cnrs.fr/corpus/Na-na%C5%A1u_(Acquaviva_Collecroce)},
journal={La collection Pangloss},
author={Breu, Walter}
}
```
#### Ainu (ainu1240)
```bibtex
@misc{ninjal-ainu-folklore,
title={A Glossed Audio Corpus of Ainu Folklore},
url={https://ainu.ninjal.ac.jp/folklore/},
author={Nakagawa, Hiroshi and Bugaeva, Anna and Kobayashi, Miki and Yoshikawa, Yoshimi},
publisher={The National Institute for Japanese Language and Linguistics ({NINJAL})},
date={2016--2021}
}
```
提供机构:
wav2gloss
原始信息汇总
Wav2Gloss Fieldwork Corpus 数据集概述
数据集信息
特征
- id: 字符串类型
- audio: 音频类型,采样率为16000Hz
- transcription: 字符串类型
- language: 字符串类型
- speaker: 字符串类型
- surface: 字符串类型
- underlying: 字符串类型
- gloss: 字符串类型
- translation: 字符串类型
- translation_language: 字符串类型
- length: 浮点数类型
- discard: 布尔类型
分割
- train: 48987个样本,大小为4841476668.601字节
- validation: 7715个样本,大小为879881255.295字节
- test: 23759个样本,大小为2556166473.915字节
大小
- 下载大小: 8175211998字节
- 数据集大小: 8277524397.811字节
配置
- config_name: default
- data_files:
- train: data/train-*
- validation: data/validation-*
- test: data/test-*
- data_files:
描述
Wav2Gloss Fieldwork corpus 是一个语言学实地录音的集合,这些录音已经经过转录和注释。该数据集由 Wav2Gloss 项目使用,旨在开发能够自动生成转录、形态分割、注释和翻译的机器学习模型,以帮助语言学家注释实地数据。
统计信息
以下是按训练和开发/测试小时数划分的语言统计:
| Glottocode | 名称 | CC 类型 | 训练 (小时) | 开发+测试 (小时) |
|---|---|---|---|---|
beja1238 |
Beja | BY-NC | 1.55 | 0.29 |
ruul1235 |
Ruuli | BY | 0.96 | 0.28 |
texi1237 |
Texistepec Popoluca | BY | 0.84 | 0.26 |
komn1238 |
Komnzo | BY | 0.73 | 0.42 |
arap1274 |
Arapaho | BY | 0.56 | 0.88 |
goro1270 |
Gorwaa | BY | 0.52 | 0.45 |
teop1238 |
Teop | BY | 0.52 | 0.52 |
nngg1234 |
Nǁng | BY | 0.52 | 0.33 |
sumi1235 |
Sümi | BY | 0.40 | 0.40 |
jeju1234 |
Jejuan | BY | 0.38 | 0.65 |
bora1263 |
Bora | BY | 0.23 | 1.44 |
apah1238 |
Yali (Apahapsili) | BY-NC-SA | 0.18 | 0.27 |
port1286 |
Daakie | BY | 0.14 | 0.75 |
savo1255 |
Savosavo | BY | 0.10 | 1.20 |
trin1278 |
Mojeño Trinitario | BY | - | 1.56 |
sout2856 |
Nafsan (South Efate) | BY-NC-SA | - | 1.55 |
pnar1238 |
Pnar | BY-NC | - | 0.91 |
kaka1265 |
Kakabe | BY | - | 0.90 |
vera1241 |
Veraa | BY | 1.02 | 0.97 |
tond1251 |
Tondano | BY | 0.22 | 0.67 |
taul1251 |
Tulil | BY | - | 1.18 |
arta1239 |
Arta | BY | - | 0.91 |
nort2641 |
Northern Kurdish | BY | - | 0.86 |
tehr1242 |
Persian | BY | - | 0.82 |
taba1259 |
Tabasaran | BY | - | 0.79 |
sanz1248 |
Sanzhi Dargwa | BY | - | 0.67 |
kach1280 |
Jinghpaw | BY | - | 0.66 |
mand1415 |
Mandarin | BY | - | 0.66 |
sumb1241 |
Sumbawa | BY | - | 0.63 |
kara1499 |
Kalamang | BY | - | 0.59 |
slav1254 |
Slavomolisano | BY-NC | 1.01 | 0.96 |
balk1252 |
Balkan Romani | BY-NC-SA | - | 0.35 |
dolg1241 |
Dolgan | BY-NC-SA | 11.64 | 1.23 |
kama1378 |
Kamas | BY-NC-SA | 9.91 | 1.15 |
selk1253 |
Selkup | BY-NC-SA | 1.70 | 1.15 |
even1259 |
Evenki | BY-NC-SA | 1.54 | 1.13 |
ainu1240 |
Ainu | BY-SA | 7.12 | 1.13 |
搜集汇总
背景与挑战
背景概述
Wav2Gloss Fieldwork Corpus是一个语言学实地录音数据集,包含超过8万个示例,覆盖37种语言(如Beja、Ainu等),用于训练机器学习模型自动生成转录、形态分割、注释和翻译,以辅助语言学家注释实地数据。数据集提供音频、转录、注释等多层次标注,并分为训练、验证和测试集,具有多语言和结构化特点。
以上内容由遇见数据集搜集并总结生成



