AADIMIND/sona-corpus

Name: AADIMIND/sona-corpus
Creator: AADIMIND
Published: 2025-09-09 18:25:23
License: 暂无描述

Hugging Face2025-09-09 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/AADIMIND/sona-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

SONA CORPUS是一个平行数据集，包含印地语和英语的噪音输入和它们被清理或更正的目标文本。该数据集用于文本清理、语法错误更正（GEC）、OCR后处理和序列到序列（Seq2Seq）微调等任务。数据集来源于印地语维基百科（HiWiki），并经过处理以包含人工噪声和OCR样错误。数据集包含581,312个示例，每个输入和目标限制在256个标记内。数据集在MIT许可下发布，并分为95%的训练数据和5%的验证数据。

The SONA CORPUS is a parallel dataset of Hindi and English text pairs, where the input is noisy text and the target is the cleaned or corrected version. It is intended for tasks such as text cleaning, grammar error correction (GEC), OCR post-processing, and sequence-to-sequence (Seq2Seq) fine-tuning. Derived from Hindi Wikipedia (HiWiki), the dataset includes artificial noise and OCR-like errors. It contains 581,312 examples, with each input and target limited to 256 tokens. Licensed under MIT, the dataset is split into 95% training data and 5% validation data.

提供机构：

AADIMIND

5,000+

优质数据集

54 个

任务类型

进入经典数据集