ai4bharat/Pralekha
收藏Hugging Face2026-01-20 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/ai4bharat/Pralekha
下载链接
链接失效反馈官方服务:
资源简介:
PRALEKHA是一个用于评估文档对齐技术的大规模基准数据集,涵盖了11种印度语言和英语,包含超过200万份文档。数据集分为对齐和非对齐两部分,比例为1:2。数据来源包括新闻公告和播客脚本,所有数据均为人工编写或验证,确保高质量。数据集的特征包括文档的唯一标识符、语言代码和文本内容。
PRALEKHA is a large-scale benchmark for evaluating document-level alignment techniques. It includes 2M+ documents, covering 11 Indic languages and English, with a balanced mix of aligned and unaligned pairs. The dataset spans two broad domains: news bulletins and podcast scripts, offering both written and spoken forms of data. All the data is human-written or human-verified, ensuring high quality. The dataset features include unique identifiers for documents, language codes, and textual content.
提供机构:
ai4bharat



