KaifengGGG/WenYanWen_English_Parallel

Name: KaifengGGG/WenYanWen_English_Parallel
Creator: KaifengGGG
Published: 2024-05-03 18:54:50
License: 暂无描述

Hugging Face2024-05-03 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/KaifengGGG/WenYanWen_English_Parallel

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit dataset_info: - config_name: default features: - name: info dtype: string - name: modern dtype: string - name: classical dtype: string - name: english dtype: string splits: - name: train num_bytes: 366918005 num_examples: 972467 download_size: 256443222 dataset_size: 366918005 - config_name: gemini-augmented features: - name: info dtype: string - name: modern dtype: string - name: classical dtype: string - name: english dtype: string - name: text dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string splits: - name: train num_bytes: 11142831.6 num_examples: 9000 - name: test num_bytes: 1238092.4 num_examples: 1000 download_size: 7541863 dataset_size: 12380924 - config_name: instruct features: - name: info dtype: string - name: modern dtype: string - name: classical dtype: string - name: english dtype: string - name: text dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string splits: - name: train num_bytes: 9876880 num_examples: 9000 - name: test num_bytes: 1104403 num_examples: 1000 download_size: 6887847 dataset_size: 10981283 - config_name: instruct-augmented features: - name: info dtype: string - name: modern dtype: string - name: classical dtype: string - name: english dtype: string - name: text dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string splits: - name: train num_bytes: 11171774 num_examples: 9000 - name: test num_bytes: 1209150 num_examples: 1000 download_size: 7561715 dataset_size: 12380924 - config_name: instruct-large features: - name: info dtype: string - name: modern dtype: string - name: classical dtype: string - name: english dtype: string - name: text dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string splits: - name: train num_bytes: 1072811901.9168397 num_examples: 970727 - name: test num_bytes: 1104403 num_examples: 1000 download_size: 673287243 dataset_size: 1073916304.9168396 configs: - config_name: default data_files: - split: train path: data/train-* - config_name: instruct data_files: - split: train path: instruct/train-* - split: test path: instruct/test-* - config_name: instruct-augmented data_files: - split: train path: instruct-augmented/train-* - split: test path: instruct-augmented/test-* - config_name: instruct-large data_files: - split: train path: instruct-large/train-* - split: test path: instruct-large/test-* task_categories: - translation - question-answering language: - zh - en size_categories: - 100K<n<1M --- # **Dataset Card for WenYanWen\_English\_Parallel** ## **Dataset Summary** The WenYanWen\_English\_Parallel dataset is a multilingual parallel corpus in Classical Chinese (Wenyanwen), modern Chinese, and English. The Classical Chinese and modern Chinese parts are sourced from the NiuTrans/Classical-Modern dataset, while the corresponding English translations are generated using Gemini Pro. ## **Data Fields** - `info`: A string representing the title or source information of the text. - `classical`: Classical Chinese (Wenyanwen) text corresponding to the modern text. - `modern`: A string containing the translation of the original Classical Chinese text into modern Chinese. - `english`: English translation of the Chinese text. - `text`: instruction/answer pair in string format - `messages`: instruction/answer pair in conversation format: - `content`: String representing the content of a message. - `role`: String representing the role associated with the message (e.g., system, assistent, user). Here is an example for a dataset entry: | Field | Type | Description | |------------|----------------|------------------------------------------------------------------------------------------| | info | string | 《辽史·列传·卷二十八》 | | modern | string | 乾统三年，徙封为秦国公。 | | classical | string | 乾统三年，徙封秦国。 | | english | string | In the third year of the Qingtong Era, he was re-enfeoffed as Prince of the Qin State. | | text | string | `<s>`[INST] 将以下现代汉语文本改写为文言文: 乾统三年，徙封为秦国公。 [/INST] 乾统三年，徙封秦国。`</s>` | | messages | list | [{"content": "将以下现代汉语文本改写为文言文: 乾统三年，徙封为秦国公。", "role": "user"}, {"content": "乾统三年，徙封秦国。", "role": "assistant"}] | ## **Dataset Structure** The dataset consists of four subsets: `default`, `instruct`, `instruct-augment`, and `instruct-large`. - `default` is a parallel translation dataset. - `instruct` serves as an instruction-tuning dataset and consists of prompt/answer pairs created from a 10,000-sample subset of the `default` dataset. - `instruct-augment` is similar to `instruct`, with the distinction being that the prompt/answer pairs have been augmented by Gemini-Pro. (Detailed information can be found in our dataset generation code on [Github](https://github.com/Kaifeng-Gao/WenYanWen_English_Parallel/tree/main)) - `instruct-large` is an expanded version of `instruct` that includes all samples from the `default` dataset. ### **Default** | `info` | `modern` | `classical` | `english` | |----------|-------------|-----------|-----------| | string | string | string | string | | Split | Examples | |-------|-----------| | Train | 972,467 | ### **Instruct** | `info` | `modern` | `classical` | `english` | `text` | `messages` | |----------|----------|-------------|-----------|--------|------------------------| | string | string | string | string | string | list of {`content`: string, `role`: string}| | Split | Examples | |-------|-----------| | Train | 9,000 | | Test | 1,000 | ### **Instruct-Augmented** | `info` | `modern` | `classical` | `english` | `text` | `messages` | |----------|----------|-------------|-----------|--------|------------------------| | string | string | string | string | string | list of {`content`: string, `role`: string}| | Split | Examples | |-------|-----------| | Train | 9,000 | | Test | 1,000 | ### **Instruct-Large** | `info` | `modern` | `classical` | `english` | `text` | `messages` | |----------|----------|-------------|-----------|--------|------------------------| | string | string | string | string | string | list of {`content`: string, `role`: string}| | Split | Examples | |-------|-----------| | Train | 875,214 | | Test | 97,246 | ## **Supported Tasks and Leaderboard** This dataset can be used for various multilingual and translation tasks, including but not limited to: 1. Neural Machine Translation (Classical Chinese to Modern Chinese) 2. Neural Machine Translation (Modern Chinese to English) 3. Neural Machine Translation (Classical Chinese to English) 4. Multilingual Text-to-Text Transfer There is currently no official leaderboard for this dataset. ## **License** Please refer to the license of the [NiuTrans/Classical-Modern](https://github.com/NiuTrans/Classical-Modern) dataset and the terms of use of Gemini Pro for more information regarding the dataset license. ## **Citation Information** If you use this dataset in your research, please cite the original sources: 1. [NiuTrans/Classical-Modern](https://github.com/NiuTrans/Classical-Modern) 2. [Gemini Pro](https://arxiv.org/abs/2403.05530) ## **Potential Bias** Since the English translations are generated using Gemini Pro, there might be inconsistencies or errors in the translations, which may introduce bias into the dataset. Additionally, the choice of Classical Chinese texts and their modern Chinese translations may also introduce bias. Finally, the use of a single translation tool for the English translations may result in limited linguistic diversity. ## **Potential Social Impact** This dataset can be used for various multilingual and translation tasks, which can have a positive impact on facilitating cross-cultural communication and understanding. However, it is important to be aware of the potential biases in the dataset and to use the dataset responsibly. Additionally, as with any dataset, it is important to consider the ethical implications of using this dataset, including issues related to data privacy, consent, and representation.

提供机构：

KaifengGGG

原始信息汇总

数据集概述：WenYanWen_English_Parallel

数据集简介

WenYanWen_English_Parallel 数据集是一个多语言平行语料库，包含古典汉语（文言文）、现代汉语和英语。古典汉语和现代汉语部分来源于 NiuTrans/Classical-Modern 数据集，对应的英语翻译由 Gemini Pro 生成。

数据字段

info: 字符串，表示文本的标题或来源信息。
classical: 古典汉语（文言文）文本，对应现代文本。
modern: 字符串，包含将原始古典汉语文本翻译成现代汉语的内容。
english: 汉语文本的英语翻译。
text: 字符串格式的指令/答案对。
messages: 对话格式的指令/答案对，包含：
- content: 字符串，表示消息的内容。
- role: 字符串，表示与消息相关的角色（如系统、助手、用户）。

数据集结构

数据集包含四个子集：default、instruct、instruct-augment 和 instruct-large。

Default

字段: info, modern, classical, english
类型: 字符串
训练集: 972,467 个样本

Instruct

字段: info, modern, classical, english, text, messages
类型: 字符串
训练集: 9,000 个样本
测试集: 1,000 个样本

Instruct-Augmented

字段: info, modern, classical, english, text, messages
类型: 字符串
训练集: 9,000 个样本
测试集: 1,000 个样本

Instruct-Large

字段: info, modern, classical, english, text, messages
类型: 字符串
训练集: 875,214 个样本
测试集: 97,246 个样本

支持的任务

该数据集可用于多种多语言和翻译任务，包括但不限于：

神经机器翻译（古典汉语到现代汉语）
神经机器翻译（现代汉语到英语）
神经机器翻译（古典汉语到英语）
多语言文本到文本转移

目前，该数据集没有官方的排行榜。

许可证

数据集的许可证请参考 NiuTrans/Classical-Modern 数据集和 Gemini Pro 的使用条款。

引用信息

如在研究中使用此数据集，请引用原始来源：

潜在偏差

由于英语翻译由 Gemini Pro 生成，可能存在翻译不一致或错误，这可能引入数据集偏差。此外，古典汉语文本及其现代汉语翻译的选择也可能引入偏差。最后，仅使用单一翻译工具进行英语翻译可能导致语言多样性受限。

潜在社会影响

该数据集可用于多种多语言和翻译任务，有助于促进跨文化交流和理解。然而，重要的是要注意数据集中可能存在的偏差，并负责任地使用数据集。此外，考虑到任何数据集的伦理影响，包括数据隐私、同意和代表性问题，也是重要的。

搜集汇总

数据集介绍

构建方式

在跨语言计算语言学领域，构建高质量平行语料库是推动机器翻译与古文理解研究的关键。本数据集以NiuTrans/Classical-Modern数据集为基础，从中提取文言文与现代汉语的对应文本，随后借助Gemini Pro模型自动生成相应的英文翻译，从而形成一个涵盖文言文、现代汉语及英语的三语平行语料库。其构建过程不仅整合了现有权威资源，还通过先进的大语言模型进行跨语言扩展，确保了数据来源的可靠性与对齐的准确性。

特点

该数据集的核心特征在于其多层次、多配置的结构设计。除了提供基础的平行翻译数据（default配置），还专门构建了用于指令微调的变体，包括instruct、instruct-augmented与instruct-large。这些变体不仅包含原始的三语文本，还额外提供了结构化的指令-回答对，以文本字符串（text）或对话列表（messages）格式呈现，极大地方便了面向对话或指令跟随的大语言模型训练。数据集规模宏大，基础配置包含近百万条样本，而指令变体则提供了从数千到数十万不等的精细划分，满足了不同研究场景下的数据需求。

使用方法

该数据集主要服务于机器翻译与多语言文本生成任务的研究与应用。研究者可通过HuggingFace数据集库直接加载不同配置，例如‘default’配置适用于传统的神经机器翻译模型训练，在文言文、现代汉语与英语之间进行任意方向的翻译。而‘instruct’系列配置则专为训练或评估能够理解并执行特定指令（如‘将现代汉语改写为文言文’）的大语言模型设计。数据集中明确的训练集与测试集划分，为模型训练与性能评估提供了标准基准。使用时应关注不同配置的字段差异，合理选择文本（modern, classical, english）或指令对（text, messages）字段以适配目标任务。

背景与挑战

背景概述

在自然语言处理领域，古典汉语与现代语言之间的转换是一项极具挑战性的任务，其不仅涉及复杂的语法与词汇差异，更承载着深厚的文化内涵。KaifengGGG/WenYanWen_English_Parallel数据集应运而生，由研究人员Kaifeng Gao构建，旨在为古典汉语、现代汉语及英语之间的多语言平行翻译与指令微调提供高质量资源。该数据集的核心研究问题聚焦于跨越古今的语言鸿沟，通过整合NiuTrans/Classical-Modern数据集中的古典与现代汉语对，并利用Gemini Pro模型生成对应的英语翻译，从而构建了一个涵盖近百万样本的大规模语料库。自创建以来，该数据集显著推动了机器翻译、跨语言理解及文化遗产数字化等领域的研究进展，为探索语言演变规律与促进跨文化交流奠定了坚实的数据基础。

当前挑战

该数据集致力于解决古典汉语与现代语言互译的领域挑战，其核心在于处理古典汉语的语法简练性、语义多义性及文化特异性，这些特性使得自动翻译任务极易产生歧义或信息丢失。构建过程中，数据集面临多重技术难题：首先，依赖单一模型Gemini Pro生成英语翻译，可能导致翻译风格单一化并引入模型固有偏差；其次，原始古典汉语语料的选取范围与质量直接影响数据集的代表性与准确性，可能存在历史文本覆盖不均的问题；此外，将平行语料转化为指令微调格式时，需确保提示与回答的语义一致性，这一过程涉及复杂的数据增强与人工校验，对构建流程的严谨性提出了较高要求。

常用场景

经典使用场景

在古典文献与现代语言处理交叉领域，该数据集为机器翻译模型提供了精准的训练素材。其经典使用场景在于构建文言文、现代汉语与英语之间的多语言平行语料库，尤其适用于训练能够理解并转换古典汉语文本的神经网络。通过提供大规模对齐的三语文本，该数据集使模型能够学习从文言文到现代汉语的语义映射，以及从汉语到英语的跨语言转换，为古籍的现代化解读与国际化传播奠定了数据基础。

衍生相关工作

围绕该数据集已衍生出多项经典研究工作，主要集中在多语言神经机器翻译与指令微调模型构建。学者利用其平行语料训练了专用于文言文翻译的Transformer架构，显著提升了古典文本的转换准确率。同时，其指令格式数据激发了面向古籍问答的对话系统研发，推动了预训练模型在专业领域的适配。这些工作不仅扩展了数据集的学术价值，也为后续的跨语言历史文本处理模型提供了可复现的基准与创新范式。

数据集最近研究