shiyue/chr_en

Name: shiyue/chr_en
Creator: shiyue
Published: 2024-01-18 14:19:36
License: 暂无描述

Hugging Face2024-01-18 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/shiyue/chr_en

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated - found - no-annotation language_creators: - found language: - chr - en license: - other multilinguality: - monolingual - multilingual - translation size_categories: - 100K<n<1M - 10K<n<100K - 1K<n<10K source_datasets: - original task_categories: - fill-mask - text-generation - translation task_ids: - language-modeling - masked-language-modeling paperswithcode_id: chren config_names: - monolingual - monolingual_raw - parallel - parallel_raw dataset_info: - config_name: monolingual features: - name: sentence dtype: string splits: - name: chr num_bytes: 882824 num_examples: 5210 - name: en5000 num_bytes: 615275 num_examples: 5000 - name: en10000 num_bytes: 1211605 num_examples: 10000 - name: en20000 num_bytes: 2432298 num_examples: 20000 - name: en50000 num_bytes: 6065580 num_examples: 49999 - name: en100000 num_bytes: 12130164 num_examples: 100000 download_size: 16967664 dataset_size: 23337746 - config_name: monolingual_raw features: - name: text_sentence dtype: string - name: text_title dtype: string - name: speaker dtype: string - name: date dtype: int32 - name: type dtype: string - name: dialect dtype: string splits: - name: full num_bytes: 1210056 num_examples: 5210 download_size: 410646 dataset_size: 1210056 - config_name: parallel features: - name: sentence_pair dtype: translation: languages: - en - chr splits: - name: train num_bytes: 3089562 num_examples: 11639 - name: dev num_bytes: 260401 num_examples: 1000 - name: out_dev num_bytes: 78126 num_examples: 256 - name: test num_bytes: 264595 num_examples: 1000 - name: out_test num_bytes: 80959 num_examples: 256 download_size: 2143266 dataset_size: 3773643 - config_name: parallel_raw features: - name: line_number dtype: string - name: sentence_pair dtype: translation: languages: - en - chr - name: text_title dtype: string - name: speaker dtype: string - name: date dtype: int32 - name: type dtype: string - name: dialect dtype: string splits: - name: full num_bytes: 5010734 num_examples: 14151 download_size: 2018726 dataset_size: 5010734 configs: - config_name: monolingual data_files: - split: chr path: monolingual/chr-* - split: en5000 path: monolingual/en5000-* - split: en10000 path: monolingual/en10000-* - split: en20000 path: monolingual/en20000-* - split: en50000 path: monolingual/en50000-* - split: en100000 path: monolingual/en100000-* - config_name: monolingual_raw data_files: - split: full path: monolingual_raw/full-* - config_name: parallel data_files: - split: train path: parallel/train-* - split: dev path: parallel/dev-* - split: out_dev path: parallel/out_dev-* - split: test path: parallel/test-* - split: out_test path: parallel/out_test-* default: true - config_name: parallel_raw data_files: - split: full path: parallel_raw/full-* --- # Dataset Card for ChrEn ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** [Github repository for ChrEn](https://github.com/ZhangShiyue/ChrEn) - **Paper:** [ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization](https://arxiv.org/abs/2010.04791) - **Point of Contact:** [benfrey@email.unc.edu](benfrey@email.unc.edu) ### Dataset Summary ChrEn is a Cherokee-English parallel dataset to facilitate machine translation research between Cherokee and English. ChrEn is extremely low-resource contains 14k sentence pairs in total, split in ways that facilitate both in-domain and out-of-domain evaluation. ChrEn also contains 5k Cherokee monolingual data to enable semi-supervised learning. ### Supported Tasks and Leaderboards The dataset is intended to use for `machine-translation` between Enlish (`en`) and Cherokee (`chr`). ### Languages The dataset contains Enlish (`en`) and Cherokee (`chr`) text. The data encompasses both existing dialects of Cherokee: the Overhill dialect, mostly spoken in Oklahoma (OK), and the Middle dialect, mostly used in North Carolina (NC). ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization Many of the source texts were translations of English materials, which means that the Cherokee structures may not be 100% natural in terms of what a speaker might spontaneously produce. Each text was translated by people who speak Cherokee as the first language, which means there is a high probability of grammaticality. These data were originally available in PDF version. We apply the Optical Character Recognition (OCR) via Tesseract OCR engine to extract the Cherokee and English text. #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? The sentences were manually aligned by Dr. Benjamin Frey a proficient second-language speaker of Cherokee, who also fixed the errors introduced by OCR. This process is time-consuming and took several months. ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators The dataset was gathered and annotated by Shiyue Zhang, Benjamin Frey, and Mohit Bansal at UNC Chapel Hill. ### Licensing Information The copyright of the data belongs to original book/article authors or translators (hence, used for research purpose; and please contact Dr. Benjamin Frey for other copyright questions). ### Citation Information ``` @inproceedings{zhang2020chren, title={ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization}, author={Zhang, Shiyue and Frey, Benjamin and Bansal, Mohit}, booktitle={EMNLP2020}, year={2020} } ``` ### Contributions Thanks to [@yjernite](https://github.com/yjernite), [@lhoestq](https://github.com/lhoestq) for adding this dataset.

提供机构：

shiyue

原始信息汇总

数据集概述

数据集描述

数据集摘要

ChrEn 是一个切罗基语-英语平行数据集，旨在促进切罗基语和英语之间的机器翻译研究。该数据集非常低资源，总共包含 14,000 个句子对，分为不同的部分，以便于域内和域外评估。ChrEn 还包含 5,000 个切罗基语单语数据，以支持半监督学习。

支持的任务和排行榜

该数据集旨在用于英语 (en) 和切罗基语 (chr) 之间的 机器翻译。

语言

数据集包含英语 (en) 和切罗基语 (chr) 文本。数据涵盖了切罗基语的两个现有方言：主要在俄克拉荷马州 (OK) 使用的 Overhill 方言和主要在北卡罗来纳州 (NC) 使用的中部方言。

数据集结构

数据实例

[更多信息需补充]

数据字段

[更多信息需补充]

数据分割

[更多信息需补充]

数据集创建

策划理由

[更多信息需补充]

源数据

初始数据收集和规范化

许多源文本是英语材料的翻译，这意味着切罗基语结构可能不完全自然，不符合说话者可能自发产生的内容。每个文本都是由以切罗基语为母语的人翻译的，这意味着语法正确性很高。这些数据最初以 PDF 格式提供。我们通过 Tesseract OCR 引擎应用光学字符识别 (OCR) 来提取切罗基语和英语文本。

源语言生产者是谁？

[更多信息需补充]

注释

注释过程

[更多信息需补充]

谁是注释者？

句子由切罗基语熟练的第二语言说话者 Dr. Benjamin Frey 手动对齐，他还修正了 OCR 引入的错误。这个过程非常耗时，花费了数月时间。

个人和敏感信息

[更多信息需补充]

使用数据的注意事项

数据集的社会影响

[更多信息需补充]

偏见的讨论

[更多信息需补充]

其他已知限制

[更多信息需补充]

附加信息

数据集策展人

数据集由 Shiyue Zhang、Benjamin Frey 和 Mohit Bansal 在北卡罗来纳大学教堂山分校收集和注释。

许可信息

数据的版权属于原始书籍/文章作者或翻译者（因此，仅用于研究目的；如有其他版权问题，请联系 Dr. Benjamin Frey）。

引用信息

@inproceedings{zhang2020chren, title={ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization}, author={Zhang, Shiyue and Frey, Benjamin and Bansal, Mohit}, booktitle={EMNLP2020}, year={2020} }

贡献

感谢 @yjernite 和 @lhoestq 添加此数据集。

搜集汇总

数据集介绍

构建方式

在濒危语言保护领域，ChrEn数据集的构建体现了对切罗基语资源的系统性整理。该数据集通过光学字符识别技术从原始PDF文档中提取文本，并经过人工校对与对齐，确保了语言材料的准确性。构建过程涵盖了单语与平行语料库的整理，其中切罗基语部分包含5210个句子，英语部分则提供了从5000到100000不等的多规模子集，平行语料库包含14151个句子对，并细分为训练集、开发集和测试集，以支持领域内外的评估。

特点

ChrEn数据集作为切罗基语与英语间的低资源机器翻译资源，其特点在于覆盖了切罗基语的两种主要方言——俄克拉荷马州的Overhill方言和北卡罗来纳州的Middle方言，这为语言变体研究提供了基础。数据集结构灵活，提供单语、原始单语、平行及原始平行四种配置，每种配置均包含丰富的元数据，如说话者、日期和文本类型，增强了数据的可追溯性与分析深度。其小规模但高质量的特性，特别适合濒危语言机器翻译模型的开发与验证。

使用方法

背景与挑战

背景概述

在语言技术领域，濒危语言的数字化保存与机器翻译研究日益受到重视。ChrEn数据集由北卡罗来纳大学教堂山分校的研究人员Shiyue Zhang、Benjamin Frey和Mohit Bansal于2020年创建，旨在解决切罗基语（Cherokee）与英语之间的机器翻译问题。该数据集包含约1.4万句平行语料及5000句切罗基语单语数据，涵盖俄克拉荷马州与北卡罗来纳州的两种方言变体。其核心研究目标是通过低资源机器翻译技术，支持切罗基语这一濒危语言的复兴与传承，为语言多样性保护提供关键数据基础。

当前挑战

ChrEn数据集面临的挑战主要体现在领域问题与构建过程两方面。在领域层面，低资源语言机器翻译本身存在数据稀疏性难题，切罗基语复杂的形态结构与方言差异进一步增加了模型泛化与准确翻译的难度。构建过程中，原始文本多来自英文材料的翻译版本，可能导致切罗基语表达不够自然；同时，光学字符识别技术在处理切罗基文字时引入的误差，需依赖人工数月时间进行校对与对齐，凸显了濒危语言语料采集与标注的高成本与复杂性。

常用场景

经典使用场景

在濒危语言保护与机器翻译研究领域，ChrEn数据集为切罗基语与英语之间的跨语言处理提供了关键资源。该数据集最经典的使用场景是构建低资源机器翻译模型，通过其包含的约1.4万句平行语料和5千句切罗基语单语数据，研究者能够训练并评估翻译系统在极低数据量下的性能表现。数据集特别设计了领域内与领域外的划分，使得模型在应对不同文本类型时具备更强的泛化能力，为濒危语言的数字化保存与跨语言交流奠定了技术基础。

衍生相关工作

围绕ChrEn数据集，学术界已衍生出一系列经典研究工作。例如，原始论文中提出的低资源机器翻译框架为后续濒危语言处理设立了基准。许多研究在此基础上探索了迁移学习、多任务学习以及数据增强策略，以进一步提升翻译质量。同时，该数据集也激发了关于语言公平性与数字包容性的讨论，促使更多学者关注非主流语言的技术支持。这些工作共同推动了计算语言学与社会语言学在濒危语言保护领域的交叉融合。

数据集最近研究