A Chinese-Tibetan Anchor-Enhanced Parallel Corpus – CT-AEPC
收藏DataCite Commons2026-04-30 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=c916746d1b0440d4a0f8af7e4a9a2cfe
下载链接
链接失效反馈官方服务:
资源简介:
The Chinese Tibetan Anchor Enhanced Parallel Corpus CT-AEPC is a data resource for Chinese Tibetan machine translation, cross language retrieval, retrieval enhanced generation, and low resource language representation learning. The original Chinese text of the dataset was collected by a self built crawler from multiple publicly available web pages. According to the publication time of traceable source web pages, the text mainly covers the period from 2011 to 2022, and the content involves fields such as news and information, government transparency, education and exams, financial technology, public services, and encyclopedia explanations. The dataset is constructed through processes such as webpage text extraction, noise cleaning, Chinese sentence segment filtering, Tibetan candidate translation generation, manual proofreading, anchor point extraction and normalization, deduplication, sensitive content filtering, and renumbering. The current version contains a total of 26672 Han Tibetan anchor point enhanced parallel text records, with a data volume of approximately 50.81 MB and a file format of JSONL. Each record consists of qid, query, positive, and meta fields. Among them, qid is the sample number, query is the Chinese sentence segment, positive is the proofread Tibetan corresponding text, and meta includes metadata such as original anchor points, normalized anchor points, Chinese Tibetan text length, and length buckets. Anchor types include date, time, number, person, location, and organizational structure. This dataset can be used for research tasks such as Chinese Tibetan machine translation, cross language semantic retrieval, RAG positive sample construction, low resource language embedding model training, reordering model evaluation, and consistency analysis of key factual information.
汉藏锚点增强平行语料库(Chinese Tibetan Anchor Enhanced Parallel Corpus, 缩写CT-AEPC)是面向汉藏机器翻译、跨语言检索、检索增强生成(Retrieval-Augmented Generation,简称RAG)以及低资源语言表征学习的专业数据资源。该数据集的原始汉语文本由自主搭建的爬虫从多渠道公开网页采集获取。依据可溯源网页的发布时间,语料覆盖时段主要为2011年至2022年,内容涵盖新闻资讯、政务公开、教育考试、金融科技、公共服务以及百科释义等多个领域。该数据集通过网页文本抽取、噪声清洗、汉语分句筛选、藏语候选译文生成、人工校对、锚点提取与归一化、去重、敏感内容过滤以及重新编号等多道流程构建完成。当前版本共计包含26672条汉藏锚点增强平行文本记录,数据体量约为50.81 MB,文件格式为JSONL。每条记录由qid、query、positive及meta四个字段组成。其中qid为样本编号,query为汉语分句,positive为经校对后的对应藏语文本,meta则包含原始锚点、归一化锚点、汉藏文本长度及长度分桶等元数据。锚点类型涵盖日期、时间、数字、人物、地点以及组织机构等类别。该数据集可应用于汉藏机器翻译、跨语言语义检索、RAG正样本构建、低资源语言嵌入模型训练、重排序模型评估以及关键事实信息一致性分析等多项研究任务。
提供机构:
Science Data Bank
创建时间:
2026-04-30



