ai4bharat/Pralekha

Name: ai4bharat/Pralekha
Creator: ai4bharat
Published: 2026-01-20 11:19:49
License: 暂无描述

Hugging Face2026-01-20 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/ai4bharat/Pralekha

下载链接

链接失效反馈

官方服务：

资源简介：

PRALEKHA是一个用于评估文档对齐技术的大规模基准数据集，涵盖了11种印度语言和英语，包含超过200万份文档。数据集分为对齐和非对齐两部分，比例为1:2。数据来源包括新闻公告和播客脚本，所有数据均为人工编写或验证，确保高质量。数据集的特征包括文档的唯一标识符、语言代码和文本内容。

PRALEKHA is a large-scale benchmark for evaluating document-level alignment techniques. It includes 2M+ documents, covering 11 Indic languages and English, with a balanced mix of aligned and unaligned pairs. The dataset spans two broad domains: news bulletins and podcast scripts, offering both written and spoken forms of data. All the data is human-written or human-verified, ensuring high quality. The dataset features include unique identifiers for documents, language codes, and textual content.

提供机构：

ai4bharat

5,000+

优质数据集

54 个

任务类型

进入经典数据集