Ancient-Modern Chinese parallel corpus

Name: Ancient-Modern Chinese parallel corpus
Creator: 四川大学
Published: 2019-05-09 12:46:01
License: 暂无描述

arXiv2019-05-09 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/1808.03738v2

下载链接

链接失效反馈

官方服务：

资源简介：

本研究创建的‘Ancient-Modern Chinese parallel corpus’是首个大规模高质量的古现代汉语平行语料库，由四川大学团队开发。该数据集包含约124万对古现代汉语句子对，数据来源于古代历史记录和名人文章。创建过程涉及文本爬取、清洗、段落对齐和条款对齐等步骤。此数据集主要用于古现代汉语的自动翻译研究，旨在解决机器翻译中古现代汉语资源稀缺的问题。

The "Ancient-Modern Chinese Parallel Corpus" created in this study is the first large-scale, high-quality parallel corpus of ancient and modern Chinese, developed by the research team at Sichuan University. This dataset contains approximately 1.24 million pairs of ancient and modern Chinese sentence pairs, with data sourced from ancient historical records and articles written by prominent figures. The development process encompasses multiple key steps including text crawling, data cleaning, paragraph alignment and clause alignment. This dataset is primarily utilized for research on automatic translation between ancient and modern Chinese, aiming to address the scarcity of resources for ancient-modern Chinese machine translation.

提供机构：

四川大学

创建时间：

2018-08-11

搜集汇总

数据集介绍