five

zehaohhhuang/Chinese-VietnameseTextAlignment

收藏
Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/zehaohhhuang/Chinese-VietnameseTextAlignment
下载链接
链接失效反馈
官方服务:
资源简介:
Zh-Vi Nhan Dan Document Dataset Overview This dataset is a real-world news document dataset specifically constructed for the Chinese-Vietnamese (Zh-Vi) low-resource language pair. The data is sourced from the bilingual official website of Vietnam's mainstream media, Nhan Dan (People's Newspaper). The dataset aims to provide high-quality raw testing and training data for research in Cross-lingual Document Alignment, Parallel Corpus Mining, and low-resource Neural Machine Translation (NMT). This dataset is also the official companion data source for the paper "[A Chinese-Vietnamese Document-Level Text Alignment Method Based on Anchor Strategy and Deep Iterative Mining]". Based on this document collection, we proposed a hierarchical alignment framework integrating Document-Level Semantic Anchors and Deep Iterative Mining, successfully extracting high-purity Zh-Vi parallel sentence pairs. Dataset Statistics This dataset contains raw multilingual news documents without manual intervention. It fully preserves the complex challenges present in real-world news compilation, such as length asymmetry and "Semantic Collapse" (where documents share similar topics but are not direct translations). The specific document counts are as follows, which are strictly consistent with the experimental setup in our paper: 🇨🇳 Chinese Documents: 3,036 🇻🇳 Vietnamese Documents: 9,688 Data Structure The data is provided in standard .jsonl (or .csv) format, making it easy to load and process. Each data record represents a complete news document.
提供机构:
zehaohhhuang
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作