five

FormosanBankDemos/formosan-mt

收藏
Hugging Face2025-12-03 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/FormosanBankDemos/formosan-mt
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: FormosanBank Machine Translation license: cc-by-4.0 task_categories: - translation language: # Formosan languages (ISO / Glottocode-style internal IDs) - ami # Amis - bnn # Bunun - ckv # Kavalan - dru # Rukai - pwn # Paiwan - pyu # Puyuma - ssf # Thao - sxr # Saaroa - szy # Sakizaya - tao # Yami/Tao - tay # Atayal - trv # Seediq/Truku - tsu # Tsou - xnb # Kanakanavu - xsy # Saisiyat # Target languages - en - zh size_categories: - 100K<n<1M tags: - translation - machine-translation - low-resource - endangered-languages - formosan-languages - text library_name: datasets configs: - config_name: formosan-en data_files: "formosan_en_hf.csv" - config_name: formosan-zh data_files: "formosan_zh_hf.csv" --- # FormosanBank Machine Translation Parallel corpora for 15 Indigenous Formosan languages aligned to English and Mandarin Chinese, prepared for use with the Hugging Face `datasets` library. The dataset aggregates processed sentence- and phrase-level corpora into two CSV files: - **Formosan → English** (`formosan_en_hf.csv`) - **Formosan → Chinese** (`formosan_zh_hf.csv`) Each row is a single bilingual sentence pair with language, dialect, split, and provenance metadata. The dataset is designed for training and evaluating neural machine translation (NMT) and related models for low-resource Formosan languages. > **IMPORTANT DISCLAIMER:** > Our Machine Translation models published on HuggingFace and in our papers were trained on this data in addition to private data not available to the public due to content restrictions. > --- ## Dataset Summary - **Total sentence pairs:** 393,634 - **Formosan → English:** 85,144 - **Formosan → Chinese:** 308,490 - **Languages (15):** Amis, Bunun, Kavalan, Rukai, Paiwan, Puyuma, Thao, Saaroa, Sakizaya, Yami/Tao, Atayal, Seediq/Truku, Tsou, Kanakanavu, Saisiyat - **Targets:** English (`en`), Mandarin Chinese (`zh`) - **Splits (all languages, both targets combined):** - Train: 334,772 - Validate: 29,412 - Test: 29,450 - **License:** CC BY 4.0 - **Format:** UTF-8 CSV, one sentence pair per row The dataset is intended to support research on low-resource MT, cross-lingual transfer, and documentation of endangered Formosan languages. --- ## Supported Tasks and Use Cases **Primary task** - `translation` - Formosan language → English - Formosan language → Chinese **Example use cases** - Training NMT systems (e.g. NLLB / encoder–decoder models) for individual Formosan languages. - Cross-lingual pretraining and evaluation for multilingual models. - Dialect-aware MT experiments using the `dialect` field. - Lexicon / dictionary-style MT from short phrases and headwords. --- ## Languages and Coverage High-level sentence counts per language (summing both directions: Formosan→English and Formosan→Chinese): | Language | Formosan→English | Formosan→Chinese | Total | |---------------|------------------|------------------|--------| | Amis | 10,523 | 30,646 | 41,169 | | Bunun | 9,006 | 30,878 | 39,884 | | Kavalan | 2,098 | 14,682 | 16,780 | | Rukai | 11,850 | 39,360 | 51,210 | | Paiwan | 9,806 | 24,015 | 33,821 | | Puyuma | 7,199 | 26,154 | 33,353 | | Thao | 2,086 | 11,633 | 13,719 | | Saaroa | 2,130 | 9,819 | 11,949 | | Sakizaya | 2,132 | 11,318 | 13,450 | | Yami/Tao | 3,009 | 12,792 | 15,801 | | Atayal | 11,724 | 35,471 | 47,195 | | Seediq/Truku | 7,244 | 29,840 | 37,084 | | Tsou | 2,117 | 8,861 | 10,978 | | Kanakanavu | 2,105 | 11,904 | 14,009 | | Saisiyat | 2,115 | 11,117 | 13,232 | | **TOTAL** | **85,144** | **308,490** | **393,634** | Many languages also include **dialect labels**, for example: - Amis: UNKNOWN, Southern, Malan, Coastal, Xiuguluan, Hengchun - Bunun: UNKNOWN, Junqun, Luanqun, Kaqun, Tanqun, Zhuoqun - Paiwan, Puyuma, Rukai, Atayal, Seediq/Truku: multiple dialects - Others (e.g. Kavalan, Thao, Saaroa, Tsou, Kanakanavu, Saisiyat, Sakizaya, Yami/Tao) currently use `UNKNOWN` dialect Dialect coverage makes it possible to do dialect-specific MT or robustness studies. --- ## Dataset Structure ### Data Files - `formosan_en_hf.csv` – all Formosan→English pairs - `formosan_zh_hf.csv` – all Formosan→Chinese pairs Each file contains all languages and splits. The **language direction** and **split** are specified per row. ### Data Fields All CSVs share the same schema: ```text id,source_lang,target_lang,source_sentence,target_sentence,lang_code,dialect,source,split ```` * `id` *(int)* – unique row identifier within each file. * `source_lang` *(str)* – language code of the Formosan language (e.g. `"ami"`, `"bnn"`). * `target_lang` *(str)* – target language code (`"en"` or `"zh"`). * `source_sentence` *(str)* – sentence or phrase in the Formosan language. * `target_sentence` *(str)* – translation into the target language. * `lang_code` *(str)* – canonical code for the Formosan language (usually same as `source_lang`). * `dialect` *(str)* – dialect label (e.g. `"Southern"`, `"Malan"`, `"UNKNOWN"`). * `source` *(str)* – provenance string or original file path in the upstream corpora. * `split` *(str)* – one of `"train"`, `"validate"`, `"test"`. ### Splits Splits are defined **per row** via the `split` column: * `train` – training data * `validate` – development / validate data * `test` – held-out test data Global totals across all languages and directions: * Train: 334,772 * Validate: 29,412 * Test: 29,450 Users can filter to any language pair and then re-group into a `DatasetDict` by `split`. --- ## How to Load the Dataset ### 1. Install dependencies ```bash pip install datasets # optional, if you plan to fine-tune models: pip install transformers ``` ### 2. Load the EN and ZH files from the Hub Assume the dataset identifier is: ```text FormosanBankDemos/formosan-mt ``` Load both CSVs: ```python from datasets import load_dataset HF_ID = "FormosanBankDemos/formosan-mt" # Formosan → English ds_en_all = load_dataset( HF_ID, data_files="formosan_en_hf.csv", )["train"] # entire CSV exposed as a 'train' split by default # Formosan → Chinese ds_zh_all = load_dataset( HF_ID, data_files="formosan_zh_hf.csv", )["train"] ``` Alternatively, if you rely on the YAML `configs` defined above: ```python # Uses config_name: "formosan-en" from the README metadata ds_en_all = load_dataset( HF_ID, name="formosan-en", split="train", ) ``` ### 3. Filter to a specific language pair (example: Amis → English, `ami → en`) ```python ami_en = ds_en_all.filter( lambda ex: ex["source_lang"] == "ami" and ex["target_lang"] == "en" ) print(ami_en) # Dataset({ # features: ['id', 'source_lang', 'target_lang', 'source_sentence', ...], # num_rows: ... # }) ``` ### 4. Get train / validation / test splits ```python from datasets import DatasetDict def split_by_column(ds): return DatasetDict({ "train": ds.filter(lambda ex: ex["split"] == "train"), "validate": ds.filter(lambda ex: ex["split"] == "validate"), "test": ds.filter(lambda ex: ex["split"] == "test"), }) ami_en_splits = split_by_column(ami_en) print(ami_en_splits) # DatasetDict({ # train: Dataset({ ... }) # validate: Dataset({ ... }) # test: Dataset({ ... }) # }) ``` ### 5. (Optional) Add a `translation` column Many translation training scripts expect a `translation` field like `{"ami": "...", "en": "..."}`. You can construct it from existing columns: ```python def add_translation(batch): translations = [] for src, tgt, sl, tl in zip( batch["source_sentence"], batch["target_sentence"], batch["source_lang"], batch["target_lang"], ): translations.append({sl: src, tl: tgt}) return {"translation": translations} ami_en_splits = ami_en_splits.map(add_translation, batched=True) print(ami_en_splits["train"][0]["translation"]) # {'ami': "sa'osi", 'en': 'true'} ``` You can reuse the same pattern for any other language pair: ```python # Example: Paiwan → English pwn_en = ds_en_all.filter( lambda ex: ex["source_lang"] == "pwn" and ex["target_lang"] == "en" ) pwn_en_splits = split_by_column(pwn_en) ``` --- ## Intended Uses, Limitations, and Risks ### Intended Uses * Research on **low-resource machine translation** for Formosan languages. * Studies of **dialect variation** in MT via the `dialect` field. * Baseline and benchmark datasets for multilingual models focusing on Austronesian languages. ### Limitations * Domain coverage is heterogeneous (dictionary-style entries, short phrases, and some longer sentences); performance may not generalize to all real-world text genres. * Dialect labels are not always available; some corpora use `UNKNOWN` for dialect. * The dataset currently encodes translations only **into** English and Chinese, not between Formosan languages. ### Risks and Biases * Source corpora may contain historical, religious, or culturally specific content that is not representative of contemporary language use. * Translations may include inconsistencies or legacy orthography; users should verify quality before high-stakes use. * As with any MT dataset for endangered languages, there is a risk of misinterpretation or over-reliance on automatically produced translations in sensitive cultural contexts. Users should avoid deploying models trained on this dataset in critical or high-stakes settings without human expert review. --- ## Citation If you use this dataset in academic work, please cite the FormosanBank project and this dataset page. A generic citation format is: FormosanBank annotations and metadata are CC-BY-4.0. This means you must cite the source in any redistributed or derived products. For code packages, you may refer to the GitHub repository. For academic publications, you should cite Mohamed, W., Le Ferrand, É., Sung, L.-M., Prud'hommeaux, E., & Hartshorne, J. K. (2024). FormosanBank. Electronic Resource. > FormosanBankDemos. *FormosanBank Machine Translation Dataset*. Hugging Face Datasets. > Available at: [https://huggingface.co/datasets/FormosanBankDemos/formosan-mt](https://huggingface.co/datasets/FormosanBankDemos/formosan-mt)
提供机构:
FormosanBankDemos
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作