five

Heng666/Traditional_Chinese-aya_dataset

收藏
Hugging Face2024-02-19 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Heng666/Traditional_Chinese-aya_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: inputs dtype: string - name: targets dtype: string - name: language dtype: string - name: language_code dtype: string - name: annotation_type dtype: string - name: user_id dtype: string splits: - name: train num_bytes: 1990086 num_examples: 4909 download_size: 981588 dataset_size: 1990086 configs: - config_name: default data_files: - split: train path: data/train-* license: apache-2.0 task_categories: - question-answering - translation - summarization - zero-shot-classification language: - zh pretty_name: Traditional_Chinese-aya_dataset size_categories: - 1M<n<10M --- ![Traditional_Chinese_Aya Header](https://huggingface.co/datasets/Heng666/Traditional_Chinese-aya_dataset/resolve/main/Traditional_Chinese_Aya_header.jpeg) <!-- Provide a quick summary of the dataset. --> ## 資料集描述 **繁體中文 Aya (Traditional Chinese Aya Chinese;TCA):專注於繁體中文處理的 Aya 集合的精選子集** ### 概述 `繁體中文 Aya` 是一個精心策劃的資料集,源自 [CohereForAI](https://huggingface.co/CohereForAI) 的綜合 Aya 集合,特別關注繁體中文文本資料。 此資料集結合了來自 [CohereForAI/aya_dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset),過濾掉除繁體中文、簡體中文內容之外的所有內容。 ### 目標 `繁體中文 Aya` 的目標是為研究人員、技術專家和語言學家提供即用型繁體中文文本資源,顯著減少專注於繁體中文的 NLP 和 AI 專案中數據預處理所需的時間和精力。 ### 資料集來源與資訊 - **資料來源**: 從 [CohereForAI/aya_dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) 2 個子集而來。 - **語言**: 繁體中文、簡體中文('zho') - **應用**: 非常適合語言建模、文本分類、情感分析、和機器翻譯等任務。 - **論文連結:** [2402.06619](https://huggingface.co/papers/2402.06619) - **維護人:** [Heng666](https://huggingface.co/Heng666) - **License:** Apache-2.0 ### 使用方法 此資料集是開始繁體中文語言專案(從學術研究到商業應用)的基礎工具。 透過提供預先過濾的繁體中文文本來源,`繁體中文 Aya` 讓研究人員、技術專家和開發人員能夠直接進行模型訓練、分析和應用程式開發,而無需進行資料清理和語言過濾的初步麻煩。 展示範例 ```python from datasets import load_dataset dataset = load_dataset("Heng666/Traditional_Chinese-aya_dataset", "default") ``` 在上面的程式碼片段中,「aya_dataset」指的是原始 「aya_dataset」中「default」子集的繁體中文版本。 您可以透過在載入資料集時指定其名稱來載入其他子集。 ### 訪問和貢獻 可在 [Heng666/Traditional_Chinese-aya_dataset](https://huggingface.co/datasets/Heng666/Traditional_Chinese-aya_dataset) 下的 Hugging Face Hub 上獲取, `繁體中文 Aya` 邀請社區做出貢獻。鼓勵用戶提供回饋、提出改進建議。 ### 支持與合作 我們致力於圍繞繁體中文人工智慧和 NLP 研究創造一個包容和支持的環境。如需支援、協作或有關資料集的疑問,請透過 Hugging Face Hub 的討論部分進行聯絡。 # Original Dataset Card of Aya by CohereForAI ![Aya Header](https://huggingface.co/datasets/CohereForAI/aya_dataset/resolve/main/aya_header.png) # Dataset Summary The `Aya Dataset` is a multilingual instruction fine-tuning dataset curated by an open-science community via [Aya Annotation Platform](https://aya.for.ai/) from Cohere For AI. The dataset contains a total of 204k human-annotated prompt-completion pairs along with the demographics data of the annotators.<br> This dataset can be used to train, finetune, and evaluate multilingual LLMs. - **Curated by:** Contributors of [Aya Open Science Intiative](https://aya.for.ai/). - **Language(s):** 65 languages (71 including dialects & scripts). - **License:** [Apache 2.0](https://opensource.org/license/apache-2-0) - **Aya Datasets Family:** | Name | Explanation | |------|--------------| | [aya_dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) | Human-annotated multilingual instruction finetuning dataset, comprising over 204K instances across 65 languages. | | [aya_collection](https://huggingface.co/datasets/CohereForAI/aya_collection) | Created by applying instruction-style templates from fluent speakers to 44 datasets, including translations of 19 instruction-style datasets into 101 languages, providing 513M instances for various tasks.| | [aya_evaluation_suite](https://huggingface.co/datasets/CohereForAI/aya_evaluation_suite) | A diverse evaluation set for multilingual open-ended generation, featuring 250 culturally grounded prompts in 7 languages, 200 translated prompts in 24 languages, and human-edited versions selected for cross-cultural relevance from English Dolly in 6 languages.| # Dataset The `Aya Dataset` comprises of two types of data: 1. **Human Annotations:** Original annotations (brand new prompts and completions written by annotators) and re-annotations (human edits of automatically generated prompts and completions). 2. **Demographics Data:** Anonymized information for each annotator. ## Load with Datasets To load this dataset consisting of both prompt-completions and demographics data with `datasets`, you'll just need to install Datasets as `pip install datasets --upgrade` and then use the following code: ```python from datasets import load_dataset # Load the annotations dataset aya_dataset = load_dataset("CohereForAI/aya_dataset") # Load the demographics dataset aya_demographics = load_dataset("CohereForAI/aya_dataset", "demographics") ``` ## Data Fields ### Human Annotations (Default) The data fields are the same among all splits: - `inputs`: Prompt or input to the language model. - `targets`: Completion or output of the language model. - `language`: The language of the `inputs` and `targets`. - `language_code`: The ISO code for the language of the `inputs` and `targets`. - `annotation_type`: The value denoting whether `inputs` and `targets` are 'original_annotations' or 're-annotations'. - `user_id`: Unique identifier of the annotator who submitted the prompt-completion pair. ### Demographics Data The data fields are the same among all splits: - `user_id`: Unique identifier of the annotator who submitted the prompt-completion pair. - `age_range`: Age of the annotator. Ranges from 0 to 121. - `gender`: Gender of the annotator. The values are 'male', 'female', 'prefer not to say', 'non-binary' and 'others'. - `languages`: List of languages spoken by the annotator. - `dialects`: Dialects reported by the annotator. Some empty values may be represented as 'null'. ## Data Splits ### Human Annotations (Default) The following are the splits of the data: | Split | No. of instances | Language Coverage | |-------|------------------|-------------------| | train | 202,364 | All | | test | 1,750 | 7 ('Standard Arabic', 'Yoruba', 'Turkish', 'English', 'Simplified Chinese', 'Portuguese', 'Telugu')| ### Demographics Data The following are the splits of the data: | Split | No. of Instances | |-------|------------------| | train | 1,456 | ## Data Instances ### Human Annotations (Default) An example of `train` looks as follows: ```json { "inputs": "What cultural events or festivals add vibrancy to Colombo's calendar...", "targets": "Colombo's cultural calendar is adorned with diverse events and festivals that celebrate the city's rich tapestry of traditions...", "language": "English", "language_code": "eng", "annotation_type": "original-annotations", "user_id": "f0ff69570af705b75c5a0851883e..." } ``` ### Demographics Data An example of `train` looks as follows: ```json { "user_id": "f0ff69570af705b75c5a0851883e...", "age_range": [ 25, 35 ], "gender": "female", "languages": [ "English", "Hausa" ], "dialects": [ "Hausa" ] } ``` ## Statistics ### Annotation Types The following is the breakdown of original annotations and re-annotations in the final dataset. | Type of Annotation | Instances | |--------------------|-----------| | Original Annotations | 138,844 | | Re-Annotations | 65,270 | | Total | 204,114| ### Languages The dataset covers 65 languages: 28 high-resource, 12 mid-resource, and 31 low-resource languages. The following is details about the languages, dialects & scripts included in the dataset. <details> <summary> Languages Info </summary> | ISO Code | Language | Resources | |----------|----------|-----------| | `amh` | Amharic | Low | | `arb`, `ary`, `ars`, `acq`, `arz` & `apc` | Arabic (Standard, Moroccan, Najdi, Ta'izzi-Adeni, Egyptian & South Levantine) | High | | `ben` | Bengali | Mid | | `ceb` | Cebuano | Mid | | `dan` | Danish | Mid | | `deu` | German | High | | `ell` | Greek | Mid | | `eng` | English | High | | `eus` | Basque | High | | `fil` | Filipino | Mid | | `fin` | Finnish | Mid | | `fra` | French | High | | `gle` | Irish | Low | | `guj` | Gujarati | Low | | `hat` | Haitian Creole | Low | | `hau` | Hausa | Low | | `hin` | Hindi | High | | `hun` | Hungarian | High | | `ibo` | Igbo | Low | | `ind` | Indonesian | Mid | | `ita` | Italian | High | | `jav` | Javanese | Low | | `jpn` | Japanese | High | | `kan` | Kannada | Low | | `kir` | Kyrgyz | Low | | `kor` | Korean | Mid | | `kur` | Kurdish | Low | | `lit` | Lithuanian | Mid | | `mal` | Malayalam | Low | | `mar` | Marathi | Low | | `mlg` | Malagasy | Low | | `msa` | Malay | Mid | | `mya` | Burmese | Low | | `nep` | Nepali | Low | | `nld` | Dutch | High | | `nso` | Northern Sotho | Low | | `nya` | Chichewa | Low | | `pan` | Punjabi | Low | | `pes` | Persian | High | | `pol` | Polish | High | | `por` | Portuguese | High | | `pus` | Pashto | Low | | `rus` | Russian | High | | `sin` | Sinhala | Low | | `sna` | Shona | Low | | `snd` | Sindhi | Low | | `som` | Somali | Low | | `spa` | Spanish | High | | `sqi` | Albanian | Low | | `srp` | Serbian | High | | `sun` | Sundanese | Low | | `swa` | Swahili | Low | | `swe` | Swedish | High | | `tam` | Tamil | Mid | | `tel` | Telugu | Low | | `tha` | Thai | Mid | | `tur` | Turkish | High | | `ukr` | Ukrainian | Mid | | `urd` | Urdu | Mid | | `vie` | Vietnamese | High | | `wol` | Wolof | Low | | `xho` | Xhosa | Low | | `yor` | Yorùbá | Low | | `zho` | Chinese (Traditional & Simplified) | High | | `zul` | Zulu | Low | </details> <br> # Motivations & Intentions - **Curation Rationale:** The curation effort employed an open-science approach to create a diverse instruction-style dataset through annotators across the globe that ensures comprehensive representation across all languages. The success of the curation effort, led by volunteers across diverse backgrounds, was significantly influenced by their hope to meaningfully bring NLP advancements to their languages. # Known Limitations - **Language and dialect coverage:** The dataset covers a limited fraction of the world's linguistic diversity, with 93% of languages not represented, facing challenges in distinguishing between languages and dialects, lacking coverage for many regional dialects, and excluding programming languages. - **Uneven distribution of contributions:** The dataset contains contributions in annotation activities, with a 'long tail' of annotators making only one or two contributions, leading to potential dataset imbalances across languages and a lack of diversity within certain language annotations. - **Cultural and Personal Bias:** In the dataset, certain languages have limited representation due to a few dominant annotators, potentially leading to a narrow viewpoint and skewed distribution of content, particularly towards certain domains like news. - **Gendered Pronouns:** Many of the languages in the Aya Dataset only contain pronouns that are explicitly gendered (e.g., Arabic) or that lack gender-neutral third-person pronouns for gender-neutral reference (e.g. Estonian). - **Formality Distinctions:** The dataset encompasses languages with diverse formality distinctions, involving honorifics and situational choices in pronoun use, reflecting varying levels of standardization influenced by regional, cultural, and identity factors. - **Toxic or Offensive Speech:** The Aya Annotation Platform lacked specific flags for toxic speech, relying on human verification and peer review to mitigate offensive content, but there's no guarantee that all potentially offensive data points were removed during the annotation process. - **Accounting for mislabeled data:** The Aya Annotation Platform lacks re-labeling capabilities, leading to potential mislabeled data in the Aya Dataset, including instances of incorrect language assignments and non-compliance with instruction-style formatting. # Additional Information ## Provenance - **Methods Used:** Crowd-sourced through volunteer annotations, followed by a quality assessment phase in which samples from the dataset were checked. - **Methodology Details:** - *Source:* Original annotations and edits of opensource NLP datasets - *Platform:* [Aya Annotation Platform](https://aya.for.ai/) - *Dates of Collection:* May 2023 - Dec 2023 ## Dataset Version and Maintenance - **Maintenance Status:** Actively Maintained - **Version Details:** - *Current version:* 1.0 - *Last Update:* 02/2024 - *First Release:* 02/2024 - **Maintenance Plan:** Updates will be periodically made available based on volunteer contributions. ## Authorship - **Publishing Organization:** [Cohere For AI](https://cohere.com/research) - **Industry Type:** Not-for-profit - Tech - **Contact Details:** https://aya.for.ai/ ## Licensing Information This dataset can be used for any purpose, whether academic or commercial, under the terms of the [Apache 2.0](https://opensource.org/license/apache-2-0) License. ## Citation Information ```bibtex @misc{singh2024aya, title={Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning}, author={Shivalika Singh and Freddie Vargus and Daniel Dsouza and Börje F. Karlsson and Abinaya Mahendiran and Wei-Yin Ko and Herumb Shandilya and Jay Patel and Deividas Mataciunas and Laura OMahony and Mike Zhang and Ramith Hettiarachchi and Joseph Wilson and Marina Machado and Luisa Souza Moura and Dominik Krzemiński and Hakimeh Fadaei and Irem Ergün and Ifeoma Okoh and Aisha Alaagib and Oshan Mudannayake and Zaid Alyafeai and Vu Minh Chien and Sebastian Ruder and Surya Guthikonda and Emad A. Alghamdi and Sebastian Gehrmann and Niklas Muennighoff and Max Bartolo and Julia Kreutzer and Ahmet Üstün and Marzieh Fadaee and Sara Hooker}, year={2024}, eprint={2402.06619}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```
提供机构:
Heng666
原始信息汇总

数据集概述

繁體中文 Aya (Traditional Chinese Aya Chinese;TCA):專注於繁體中文處理的 Aya 集合的精選子集

概述

繁體中文 Aya 是一個精心策劃的資料集,源自 CohereForAI 的綜合 Aya 集合,特別關注繁體中文文本資料。此資料集結合了來自 CohereForAI/aya_dataset,過濾掉除繁體中文、簡體中文內容之外的所有內容。

目標

繁體中文 Aya 的目標是為研究人員、技術專家和語言學家提供即用型繁體中文文本資源,顯著減少專注於繁體中文的 NLP 和 AI 專案中數據預處理所需的時間和精力。

資料集來源與資訊

  • 資料來源: 從 CohereForAI/aya_dataset 2 個子集而來。
  • 語言: 繁體中文、簡體中文(zho)
  • 應用: 非常適合語言建模、文本分類、情感分析、和機器翻譯等任務。
  • 論文連結: 2402.06619
  • 維護人: Heng666
  • License: Apache-2.0

使用方法

此資料集是開始繁體中文語言專案(從學術研究到商業應用)的基礎工具。透過提供預先過濾的繁體中文文本來源,繁體中文 Aya 讓研究人員、技術專家和開發人員能夠直接進行模型訓練、分析和應用程式開發,而無需進行資料清理和語言過濾的初步麻煩。

展示範例 python from datasets import load_dataset dataset = load_dataset("Heng666/Traditional_Chinese-aya_dataset", "default")

在上面的程式碼片段中,「aya_dataset」指的是原始 「aya_dataset」中「default」子集的繁體中文版本。您可以透過在載入資料集時指定其名稱來載入其他子集。

訪問和貢獻

可在 Heng666/Traditional_Chinese-aya_dataset 下的 Hugging Face Hub 上獲取,繁體中文 Aya 邀請社區做出貢獻。鼓勵用戶提供回饋、提出改進建議。

支持與合作

我們致力於圍繞繁體中文人工智慧和 NLP 研究創造一個包容和支持的環境。如需支援、協作或有關資料集的疑問,請透過 Hugging Face Hub 的討論部分進行聯絡。

搜集汇总
数据集介绍
main_image_url
构建方式
在自然语言处理领域,针对特定语言资源的构建往往需要从大规模多语言数据集中进行精细化筛选。本数据集源自CohereForAI发布的Aya多语言指令微调数据集,通过系统化过滤机制,从原始数据中保留了繁体中文与简体中文的文本内容。其构建过程依托于Aya标注平台的社区协作框架,采用人工标注与重标注相结合的方式,确保了数据质量与语言表达的丰富性。该过程不仅涉及语言代码的识别与提取,还整合了标注者的匿名化元数据,为后续研究提供了可追溯的语料来源。
特点
作为专注于繁体中文处理的精选资源,本数据集在语言覆盖上呈现出鲜明的针对性。它包含了4909条训练实例,每条数据均具备输入、输出、语言类型、标注类别及用户标识等多维度特征。数据集中融合了原始标注与重标注两种类型,既保留了人工创作的多样性,又通过后期修正提升了指令的规范性与适用性。此外,数据集遵循Apache 2.0许可协议,支持学术与商业用途,并附有完整的元数据描述,便于用户在多任务场景下进行跨语言对比与模型评估。
使用方法
在应用层面,本数据集为繁体中文自然语言处理任务提供了即用型基础资源。用户可通过Hugging Face的datasets库直接加载,使用默认配置即可获取全部训练数据。该数据集适用于语言建模、文本分类、机器翻译等多种任务,能够有效减少数据预处理环节的负担。研究人员可依据语言代码或标注类型进行子集筛选,亦可通过用户标识关联匿名化的人口统计信息,以探索语言使用中的社会文化维度。数据集的设计兼顾了易用性与扩展性,为后续模型微调与跨语言泛化研究奠定了数据基础。
背景与挑战
背景概述
在自然语言处理领域,多语言指令微调数据集对于推动语言模型在多样化语境下的应用至关重要。Heng666/Traditional_Chinese-aya_dataset作为Aya数据集家族的一个精选子集,由维护者Heng666于2024年基于CohereForAI发布的原始aya_dataset构建而成。该数据集专注于繁体中文文本处理,旨在为研究人员和技术专家提供高质量的繁体中文语言资源,以支持语言建模、文本分类及机器翻译等任务。其构建依托于开放科学社区的协作标注,体现了对低资源语言支持的学术关怀,并对促进中文自然语言处理技术的均衡发展具有显著影响力。
当前挑战
该数据集致力于解决多语言指令微调中的领域挑战,特别是在繁体中文语境下,模型需克服语言变体间的细微差异、文化特定表达的理解以及低资源语言数据稀缺等问题。在构建过程中,面临的主要挑战包括:从原始多语言数据中精确过滤与标注繁体中文内容,确保语言纯度和语境相关性;处理简繁体中文间的转换与一致性维护,避免语义失真;以及应对数据标注过程中可能存在的文化偏见与领域覆盖不均,需通过严谨的质量控制机制来保障数据集的代表性与可靠性。
常用场景
经典使用场景
在自然语言处理领域,繁体中文文本资源相对稀缺,Heng666/Traditional_Chinese-aya_dataset 的构建为相关研究提供了关键支持。该数据集最经典的使用场景在于作为指令微调的基础语料,专门用于训练和优化面向繁体中文的多语言大语言模型。研究人员能够直接利用其中高质量的人工标注提示-完成对,进行模型的有监督微调,从而提升模型在理解、生成繁体中文指令方面的能力。这种应用显著降低了从原始数据收集到清洗的预处理负担,使学术探索能更专注于模型架构与算法的创新。
衍生相关工作
围绕该数据集及其母集Aya Dataset,已衍生出一系列重要的研究工作。例如,基于Aya Collection进行的大规模多语言指令模板应用研究,探索了如何将任务范式高效迁移至百余种语言。同时,针对其构建的Aya Evaluation Suite为多语言开放生成任务设立了新的评估基准。这些经典工作共同深化了我们对多语言模型指令遵循能力、文化适应性以及评估方法论的理解,形成了从数据构建、模型训练到系统评估的完整研究脉络,持续推动着包容性人工智能的发展。
数据集最近研究
最新研究方向
在自然语言处理领域,繁体中文作为重要的语言变体,其资源稀缺性长期制约着相关模型的发展。Heng666/Traditional_Chinese-aya_dataset 的推出,恰好响应了多语言大模型对高质量繁体中文指令微调数据的迫切需求。当前研究前沿聚焦于利用此类精选数据集,提升模型在繁体中文场景下的指令遵循能力、跨语言迁移效率以及文化适应性。随着Aya倡议等开放科学运动的推进,该数据集促进了繁体中文与大湾区语言技术生态的融合,为学术与工业界提供了标准化评估基准,助力消弭数字语言鸿沟,推动包容性人工智能的发展。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作