zehaohhhuang/Chinese-VietnameseTextAlignment

Name: zehaohhhuang/Chinese-VietnameseTextAlignment
Creator: zehaohhhuang
Published: 2026-03-19 14:24:36
License: 暂无描述

Hugging Face2026-03-19 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/zehaohhhuang/Chinese-VietnameseTextAlignment

下载链接

链接失效反馈

官方服务：

资源简介：

Zh-Vi Nhan Dan Document Dataset Overview This dataset is a real-world news document dataset specifically constructed for the Chinese-Vietnamese (Zh-Vi) low-resource language pair. The data is sourced from the bilingual official website of Vietnam's mainstream media, Nhan Dan (People's Newspaper). The dataset aims to provide high-quality raw testing and training data for research in Cross-lingual Document Alignment, Parallel Corpus Mining, and low-resource Neural Machine Translation (NMT). This dataset is also the official companion data source for the paper "[A Chinese-Vietnamese Document-Level Text Alignment Method Based on Anchor Strategy and Deep Iterative Mining]". Based on this document collection, we proposed a hierarchical alignment framework integrating Document-Level Semantic Anchors and Deep Iterative Mining, successfully extracting high-purity Zh-Vi parallel sentence pairs. Dataset Statistics This dataset contains raw multilingual news documents without manual intervention. It fully preserves the complex challenges present in real-world news compilation, such as length asymmetry and "Semantic Collapse" (where documents share similar topics but are not direct translations). The specific document counts are as follows, which are strictly consistent with the experimental setup in our paper: 🇨🇳 Chinese Documents: 3,036 🇻🇳 Vietnamese Documents: 9,688 Data Structure The data is provided in standard .jsonl (or .csv) format, making it easy to load and process. Each data record represents a complete news document.

提供机构：

zehaohhhuang

5,000+

优质数据集

54 个

任务类型

进入经典数据集