five

wav2gloss/fieldwork

收藏
Hugging Face2024-08-01 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/wav2gloss/fieldwork
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 dataset_info: features: - name: id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: transcription dtype: string - name: language dtype: string - name: speaker dtype: string - name: surface dtype: string - name: underlying dtype: string - name: gloss dtype: string - name: translation dtype: string - name: translation_language dtype: string - name: length dtype: float32 - name: discard dtype: bool splits: - name: train num_bytes: 4841476668.601 num_examples: 48987 - name: validation num_bytes: 879881255.295 num_examples: 7715 - name: test num_bytes: 2556166473.915 num_examples: 23759 download_size: 8175211998 dataset_size: 8277524397.811 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- # Wav2Gloss Fieldwork Corpus ## Description The Wav2Gloss Fieldwork corpus is a collection of linguistic field recordings which have been previously transcribed and glossed. The dataset is used by Wav2Gloss project to develop machine learning models that can automatically generate transcriptions, morphological segmentations, glosses, and translations, with the goal of helping linguists annotate field data. - 10-minute presentation: https://drive.google.com/file/d/1nSelG6a56924JYMnnvKysb805R-lSZIa/view?usp=sharing - Slides: https://github.com/juice500ml/finetune_owsm/blob/main/slides.pdf - Poster: https://github.com/juice500ml/finetune_owsm/blob/main/poster.pdf ## Statistics See below for a breakdown of languages by training and dev/test hours. | Glottocode | Name | CC Type | Train (h) | Dev+Test (h) | | ---------- | -------------------- | -------- | --------- | ------------ | | `beja1238` | Beja | BY-NC | 1.55 | 0.29 | | `ruul1235` | Ruuli | BY | 0.96 | 0.28 | | `texi1237` | Texistepec Popoluca | BY | 0.84 | 0.26 | | `komn1238` | Komnzo | BY | 0.73 | 0.42 | | `arap1274` | Arapaho | BY | 0.56 | 0.88 | | `goro1270` | Gorwaa | BY | 0.52 | 0.45 | | `teop1238` | Teop | BY | 0.52 | 0.52 | | `nngg1234` | Nǁng | BY | 0.52 | 0.33 | | `sumi1235` | Sümi | BY | 0.40 | 0.40 | | `jeju1234` | Jejuan | BY | 0.38 | 0.65 | | `bora1263` | Bora | BY | 0.23 | 1.44 | | `apah1238` | Yali (Apahapsili) | BY-NC-SA | 0.18 | 0.27 | | `port1286` | Daakie | BY | 0.14 | 0.75 | | `savo1255` | Savosavo | BY | 0.10 | 1.20 | | `trin1278` | Mojeño Trinitario | BY | - | 1.56 | | `sout2856` | Nafsan (South Efate) | BY-NC-SA | - | 1.55 | | `pnar1238` | Pnar | BY-NC | - | 0.91 | | `kaka1265` | Kakabe | BY | - | 0.90 | | `vera1241` | Vera'a | BY | 1.02 | 0.97 | | `tond1251` | Tondano | BY | 0.22 | 0.67 | | `taul1251` | Tulil | BY | - | 1.18 | | `arta1239` | Arta | BY | - | 0.91 | | `nort2641` | Northern Kurdish | BY | - | 0.86 | | `tehr1242` | Persian | BY | - | 0.82 | | `taba1259` | Tabasaran | BY | - | 0.79 | | `sanz1248` | Sanzhi Dargwa | BY | - | 0.67 | | `kach1280` | Jinghpaw | BY | - | 0.66 | | `mand1415` | Mandarin | BY | - | 0.66 | | `sumb1241` | Sumbawa | BY | - | 0.63 | | `kara1499` | Kalamang | BY | - | 0.59 | | `slav1254` | Slavomolisano | BY-NC | 1.01 | 0.96 | | `balk1252` | Balkan Romani | BY-NC-SA | - | 0.35 | | `dolg1241` | Dolgan | BY-NC-SA | 11.64 | 1.23 | | `kama1378` | Kamas | BY-NC-SA | 9.91 | 1.15 | | `selk1253` | Selkup | BY-NC-SA | 1.70 | 1.15 | | `even1259` | Evenki | BY-NC-SA | 1.54 | 1.13 | | `ainu1240` | Ainu | BY-SA | 7.12 | 1.13 | ## Baseline Systems - XLS-R, WavLM: https://github.com/juice500ml/espnet/tree/wav2gloss/ - OWSM: https://github.com/juice500ml/finetune_owsm ## Citation ```bibtex @inproceedings{he-etal-2024-wav2gloss, title = "Wav2Gloss: Generating Interlinear Glossed Text from Speech", author = "Taiqi He and Kwanghee Choi and Lindia Tjuatja and Nathaniel R. Robinson and Jiatong Shi and Shinji Watanabe and Graham Neubig and David R. Mortensen and Lori Levin", booktitle = "Proceedings of the 62st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", year = "2024", publisher = "Association for Computational Linguistics", } ``` ## Corpora citations #### Yali (Apahapsili) (apah1238) ```bibtex @incollection{doreco-apah1238, address = {Berlin \& Lyon}, author = {Riesberg, Sonja}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Yali (Apahapsili) DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/apah1238}, doi = {10.34847/nkl.9d91nkq2}, urldate = {07/10/2023}, year = {2022} } ``` #### Arapaho (arap1274) ```bibtex @incollection{doreco-arap1274, address = {Berlin \& Lyon}, author = {Cowell, Andrew}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Arapaho DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/arap1274}, doi = {10.34847/nkl.36f5r1b6}, urldate = {07/10/2023}, year = {2022} } ``` #### Beja (beja1238) ```bibtex @incollection{doreco-beja1238, address = {Berlin \& Lyon}, author = {Vanhove, Martine}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Beja DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/beja1238}, doi = {10.34847/nkl.edd011t1}, urldate = {07/10/2023}, year = {2022} } ``` #### Bora (bora1263) ```bibtex @incollection{doreco-bora1263, address = {Berlin \& Lyon}, author = {Seifart, Frank}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Bora DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/bora1263}, doi = {10.34847/nkl.6eaf5laq}, urldate = {07/10/2023}, year = {2022} } ``` #### Gorwaa (goro1270) ```bibtex @incollection{doreco-goro1270, address = {Berlin \& Lyon}, author = {Harvey, Andrew}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Gorwaa DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/goro1270}, doi = {10.34847/nkl.a4b4ijj2}, urldate = {07/10/2023}, year = {2022} } ``` #### Jejuan (jeju1234) ```bibtex @incollection{doreco-jeju1234, address = {Berlin \& Lyon}, author = {Kim, Soung-U}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Jejuan DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/jeju1234}, doi = {10.34847/nkl.06ebrk38}, urldate = {07/10/2023}, year = {2022} } ``` #### Kakabe (kaka1265) ```bibtex @incollection{doreco-kaka1265, address = {Berlin \& Lyon}, author = {Vydrina, Alexandra}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Kakabe DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/kaka1265}, doi = {10.34847/nkl.d5aeu9t6}, urldate = {07/10/2023}, year = {2022} } ``` #### Komnzo (komn1238) ```bibtex @incollection{doreco-komn1238, address = {Berlin \& Lyon}, author = {Döhler, Christian}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Komnzo DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/komn1238}, doi = {10.34847/nkl.c5e6dudv}, urldate = {07/10/2023}, year = {2022} } ``` #### Nǁng (nngg1234) ```bibtex @incollection{doreco-nngg1234, address = {Berlin \& Lyon}, author = {Güldemann, Tom and Ernszt, Martina and Siegmund, Sven and Witzlack-Makarevich, Alena}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Nǁng DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/nngg1234}, doi = {10.34847/nkl.f6c37fi0}, urldate = {07/10/2023}, year = {2022} } ``` #### Pnar (pnar1238) ```bibtex @incollection{doreco-pnar1238, address = {Berlin \& Lyon}, author = {Ring, Hiram}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Pnar DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/pnar1238}, doi = {10.34847/nkl.5ba1062k}, urldate = {07/10/2023}, year = {2022} } ``` #### Daakie (port1286) ```bibtex @incollection{doreco-port1286, address = {Berlin \& Lyon}, author = {Krifka, Manfred}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Daakie DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/port1286}, doi = {10.34847/nkl.efeav5l9}, urldate = {07/10/2023}, year = {2022} } ``` #### Ruuli (ruul1235) ```bibtex @incollection{doreco-ruul1235, address = {Berlin \& Lyon}, author = {Witzlack-Makarevich, Alena and Namyalo, Saudah and Kiriggwajjo, Anatol and Molochieva, Zarina}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Ruuli DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/ruul1235}, doi = {10.34847/nkl.fde4pp1u}, urldate = {07/10/2023}, year = {2022} } ``` #### Savosavo (savo1255) ```bibtex @incollection{doreco-savo1255, address = {Berlin \& Lyon}, author = {Wegener, Claudia}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Savosavo DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/savo1255}, doi = {10.34847/nkl.b74d1b33}, urldate = {07/10/2023}, year = {2022} } ``` #### Nafsan (South Efate) (sout2856) ```bibtex @incollection{doreco-sout2856, address = {Berlin \& Lyon}, author = {Thieberger, Nick}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Nafsan (South Efate) DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/sout2856}, doi = {10.34847/nkl.ba4f760l}, urldate = {07/10/2023}, year = {2022} } ``` #### Sümi (sumi1235) ```bibtex @incollection{doreco-sumi1235, address = {Berlin \& Lyon}, author = {Teo, Amos}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Sümi DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/sumi1235}, doi = {10.34847/nkl.5ad4t01p}, urldate = {07/10/2023}, year = {2022} } ``` #### Teop (teop1238) ```bibtex @incollection{doreco-teop1238, address = {Berlin \& Lyon}, author = {Mosel, Ulrike}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Teop DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/teop1238}, doi = {10.34847/nkl.9322sdf2}, urldate = {07/10/2023}, year = {2022} } ``` #### Texistepec Popoluca (texi1237) ```bibtex @incollection{doreco-texi1237, address = {Berlin \& Lyon}, author = {Wichmann, Søren}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Texistepec Popoluca DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/texi1237}, doi = {10.34847/nkl.c50ck58f}, urldate = {07/10/2023}, year = {2022} } ``` #### Mojeño Trinitario (trin1278) ```bibtex @incollection{doreco-trin1278, address = {Berlin \& Lyon}, author = {Rose, Françoise}, booktitle = {Language Documentation Reference Corpus (DoReCo) 1.2}, editor = {Seifart, Frank and Paschen, Ludger and Stave, Matthew}, publisher = {Leibniz-Zentrum Allgemeine Sprachwissenschaft \& laboratoire Dynamique Du Langage (UMR5596, CNRS \& Université Lyon 2)}, title = {Mojeño Trinitario DoReCo dataset}, url = {https://doreco.huma-num.fr/languages/trin1278}, doi = {10.34847/nkl.cbc3b4xr}, urldate = {07/10/2023}, year = {2022} } ``` #### Dolgan (dolg1241) ```bibtex @misc{inel-dolgan, author = {Däbritz, Chris Lasse and Kudryakova, Nina and Stapert, Eugénie}, title = {INEL Dolgan Corpus}, month = nov, year = 2022, doi = {10.25592/uhhfdm.11165}, url = {https://doi.org/10.25592/uhhfdm.11165} } ``` #### Evenki (even1259) ```bibtex @misc{inel-evenki, author = {Däbritz, Chris Lasse and Gusev, Valentin}, title = {INEL Evenki Corpus}, month = dec, year = 2021, doi = {10.25592/uhhfdm.9628}, url = {https://doi.org/10.25592/uhhfdm.9628} } ``` #### Kamas (kama1378) ```bibtex @misc{inel-kamas, author = {Gusev, Valentin and Klooster, Tiina and Wagner-Nagy, Beáta}, title = {INEL Kamas Corpus}, month = dec, year = 2019, doi = {10.25592/uhhfdm.9752}, url = {https://doi.org/10.25592/uhhfdm.9752} } ``` #### Selkup (selk1253) ```bibtex @misc{inel-selkup, author = {Brykina, Maria and Orlova, Svetlana and Wagner-Nagy, Beáta}, title = {INEL Selkup Corpus}, month = dec, year = 2021, doi = {10.25592/uhhfdm.9754}, url = {https://doi.org/10.25592/uhhfdm.9754} } ``` #### Arta (arta1239) ```bibtex @incollection{arta1239, author = {Kimoto, Yukinori}, title = {{Multi-CAST Arta}}, year = {2019}, editor = {Haig, Geoffrey and Schnell, Stefan}, booktitle = {{Multi-CAST}}, booksubtitle = {{Multilingual corpus of annotated spoken texts}}, note = {Version 2211}}, address = {Bamberg}, publisher = {University of Bamberg}, url = {multicast.aspra.uni-bamberg.de/#arta} } ``` #### Jinghpaw (kach1280) ```bibtex @incollection{kach1280, author = {Kurabe, Keita}, title = {{Multi-CAST Jinghpaw}}, year = {2021}, editor = {Haig, Geoffrey and Schnell, Stefan}, booktitle = {{Multi-CAST}}, booksubtitle = {{Multilingual corpus of annotated spoken texts}}, note = {Version 2211}}, address = {Bamberg}, publisher = {University of Bamberg}, url = {multicast.aspra.uni-bamberg.de/#jinghpaw} } ``` #### Kalamang (kara1499) ```bibtex @incollection{kara1499, author = {Visser, Eline}, title = {{Multi-CAST Kalamang}}, year = {2021}, editor = {Haig, Geoffrey and Schnell, Stefan}, booktitle = {{Multi-CAST}}, booksubtitle = {{Multilingual corpus of annotated spoken texts}}, note = {Version 2211}}, address = {Bamberg}, publisher = {University of Bamberg}, url = {multicast.aspra.uni-bamberg.de/#kalamang} } ``` #### Mandarin (mand1415) ```bibtex @incollection{mand1415, author = {Vollmer, Maria}, title = {{Multi-CAST Mandarin}}, year = {2020}, editor = {Haig, Geoffrey and Schnell, Stefan}, booktitle = {{Multi-CAST}}, booksubtitle = {{Multilingual corpus of annotated spoken texts}}, note = {Version 2211}}, address = {Bamberg}, publisher = {University of Bamberg}, url = {multicast.aspra.uni-bamberg.de/#mandarin} } ``` #### Northern Kurdish (nort2641) ```bibtex @incollection{nort2641, author = {Haig, Geoffrey and Vollmer, Maria and Thiele, Hanna}, title = {{Multi-CAST Northern Kurdish}}, year = {2015}, editor = {Haig, Geoffrey and Schnell, Stefan}, booktitle = {{Multi-CAST}}, booksubtitle = {{Multilingual corpus of annotated spoken texts}}, note = {Version 2211}}, address = {Bamberg}, publisher = {University of Bamberg}, url = {multicast.aspra.uni-bamberg.de/#nkurd} } ``` #### Sanzhi Dargwa (sanz1248) ```bibtex @incollection{sanz1248, author = {Forker, Diana and Schiborr, Nils N.}, title = {{Multi-CAST Sanzhi Dargwa}}, year = {2019}, editor = {Haig, Geoffrey and Schnell, Stefan}, booktitle = {{Multi-CAST}}, booksubtitle = {{Multilingual corpus of annotated spoken texts}}, note = {Version 2211}}, address = {Bamberg}, publisher = {University of Bamberg}, url = {multicast.aspra.uni-bamberg.de/#sanzhi} } ``` #### Sumbawa (sumb1241) ```bibtex @incollection{sumb1241, author = {Shiohara, Asako}, title = {{Multi-CAST Sumbawa}}, year = {2022}, editor = {Haig, Geoffrey and Schnell, Stefan}, booktitle = {{Multi-CAST}}, booksubtitle = {{Multilingual corpus of annotated spoken texts}}, note = {Version 2211}}, address = {Bamberg}, publisher = {University of Bamberg}, url = {multicast.aspra.uni-bamberg.de/#sumbawa} } ``` #### Tabasaran (taba1259) ```bibtex @incollection{taba1259, author = {Bogomolova, Natalia & Ganenkov, Dmitry & Schiborr, Nils N.}, title = {{Multi-CAST Tabasaran}}, year = {2021}, editor = {Haig, Geoffrey and Schnell, Stefan}, booktitle = {{Multi-CAST}}, booksubtitle = {{Multilingual corpus of annotated spoken texts}}, note = {Version 2211}}, address = {Bamberg}, publisher = {University of Bamberg}, url = {multicast.aspra.uni-bamberg.de/#tabasaran} } ``` #### Tulil (taul1251) ```bibtex @incollection{taul1251, author = {Meng, Chenxi}, title = {{Multi-CAST Tulil}}, year = {2016}, editor = {Haig, Geoffrey and Schnell, Stefan}, booktitle = {{Multi-CAST}}, booksubtitle = {{Multilingual corpus of annotated spoken texts}}, note = {Version 2211}}, address = {Bamberg}, publisher = {University of Bamberg}, url = {multicast.aspra.uni-bamberg.de/#tulil} } ``` #### Persian (tehr1242) ```bibtex @incollection{tehr1242, author = {Adibifar, Shirin}, title = {{Multi-CAST Persian}}, year = {2016}, editor = {Haig, Geoffrey and Schnell, Stefan}, booktitle = {{Multi-CAST}}, booksubtitle = {{Multilingual corpus of annotated spoken texts}}, note = {Version 2211}}, address = {Bamberg}, publisher = {University of Bamberg}, url = {multicast.aspra.uni-bamberg.de/#persian} } ``` #### Tondano (tond1251) ```bibtex @incollection{tond1251, author = {Brickell, Timothy}, title = {{Multi-CAST Tondano}}, year = {2016}, editor = {Haig, Geoffrey and Schnell, Stefan}, booktitle = {{Multi-CAST}}, booksubtitle = {{Multilingual corpus of annotated spoken texts}}, note = {Version 2211}}, address = {Bamberg}, publisher = {University of Bamberg}, url = {multicast.aspra.uni-bamberg.de/#tondano} } ``` #### Vera'a (vera1241) ```bibtex @incollection{vera1241, author = {Schnell, Stefan}, title = {{Multi-CAST Vera'a}}, year = {2015}, editor = {Haig, Geoffrey and Schnell, Stefan}, booktitle = {{Multi-CAST}}, booksubtitle = {{Multilingual corpus of annotated spoken texts}}, note = {Version 2211}}, address = {Bamberg}, publisher = {University of Bamberg}, url = {multicast.aspra.uni-bamberg.de/#veraa} } ``` #### Balkan Romani (balk1252) ```bibtex @misc{balk1252, title={{Le romani (xoraxane, vlax du sud, Grèce)}}, url={https://pangloss.cnrs.fr/corpus/Romani_(Xoraxane,_Southern_Vlax,_Greece)}, journal={La collection Pangloss}, author={Adamou, Evangelia} } ``` #### Slavomolisano (slav1254) ```bibtex @misc{slav1254, title={{Na-našu (slave Molisan) : Le dialecte d’acquaviva collecroce}}, url={https://pangloss.cnrs.fr/corpus/Na-na%C5%A1u_(Acquaviva_Collecroce)}, journal={La collection Pangloss}, author={Breu, Walter} } ``` #### Ainu (ainu1240) ```bibtex @misc{ninjal-ainu-folklore, title={A Glossed Audio Corpus of Ainu Folklore}, url={https://ainu.ninjal.ac.jp/folklore/}, author={Nakagawa, Hiroshi and Bugaeva, Anna and Kobayashi, Miki and Yoshikawa, Yoshimi}, publisher={The National Institute for Japanese Language and Linguistics ({NINJAL})}, date={2016--2021} } ```
提供机构:
wav2gloss
原始信息汇总

Wav2Gloss Fieldwork Corpus 数据集概述

数据集信息

特征

  • id: 字符串类型
  • audio: 音频类型,采样率为16000Hz
  • transcription: 字符串类型
  • language: 字符串类型
  • speaker: 字符串类型
  • surface: 字符串类型
  • underlying: 字符串类型
  • gloss: 字符串类型
  • translation: 字符串类型
  • translation_language: 字符串类型
  • length: 浮点数类型
  • discard: 布尔类型

分割

  • train: 48987个样本,大小为4841476668.601字节
  • validation: 7715个样本,大小为879881255.295字节
  • test: 23759个样本,大小为2556166473.915字节

大小

  • 下载大小: 8175211998字节
  • 数据集大小: 8277524397.811字节

配置

  • config_name: default
    • data_files:
      • train: data/train-*
      • validation: data/validation-*
      • test: data/test-*

描述

Wav2Gloss Fieldwork corpus 是一个语言学实地录音的集合,这些录音已经经过转录和注释。该数据集由 Wav2Gloss 项目使用,旨在开发能够自动生成转录、形态分割、注释和翻译的机器学习模型,以帮助语言学家注释实地数据。

统计信息

以下是按训练和开发/测试小时数划分的语言统计:

Glottocode 名称 CC 类型 训练 (小时) 开发+测试 (小时)
beja1238 Beja BY-NC 1.55 0.29
ruul1235 Ruuli BY 0.96 0.28
texi1237 Texistepec Popoluca BY 0.84 0.26
komn1238 Komnzo BY 0.73 0.42
arap1274 Arapaho BY 0.56 0.88
goro1270 Gorwaa BY 0.52 0.45
teop1238 Teop BY 0.52 0.52
nngg1234 Nǁng BY 0.52 0.33
sumi1235 Sümi BY 0.40 0.40
jeju1234 Jejuan BY 0.38 0.65
bora1263 Bora BY 0.23 1.44
apah1238 Yali (Apahapsili) BY-NC-SA 0.18 0.27
port1286 Daakie BY 0.14 0.75
savo1255 Savosavo BY 0.10 1.20
trin1278 Mojeño Trinitario BY - 1.56
sout2856 Nafsan (South Efate) BY-NC-SA - 1.55
pnar1238 Pnar BY-NC - 0.91
kaka1265 Kakabe BY - 0.90
vera1241 Veraa BY 1.02 0.97
tond1251 Tondano BY 0.22 0.67
taul1251 Tulil BY - 1.18
arta1239 Arta BY - 0.91
nort2641 Northern Kurdish BY - 0.86
tehr1242 Persian BY - 0.82
taba1259 Tabasaran BY - 0.79
sanz1248 Sanzhi Dargwa BY - 0.67
kach1280 Jinghpaw BY - 0.66
mand1415 Mandarin BY - 0.66
sumb1241 Sumbawa BY - 0.63
kara1499 Kalamang BY - 0.59
slav1254 Slavomolisano BY-NC 1.01 0.96
balk1252 Balkan Romani BY-NC-SA - 0.35
dolg1241 Dolgan BY-NC-SA 11.64 1.23
kama1378 Kamas BY-NC-SA 9.91 1.15
selk1253 Selkup BY-NC-SA 1.70 1.15
even1259 Evenki BY-NC-SA 1.54 1.13
ainu1240 Ainu BY-SA 7.12 1.13
搜集汇总
背景与挑战
背景概述
Wav2Gloss Fieldwork Corpus是一个语言学实地录音数据集,包含超过8万个示例,覆盖37种语言(如Beja、Ainu等),用于训练机器学习模型自动生成转录、形态分割、注释和翻译,以辅助语言学家注释实地数据。数据集提供音频、转录、注释等多层次标注,并分为训练、验证和测试集,具有多语言和结构化特点。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作