IndicHG
收藏arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/ai4bharat/IndicHeadlineGeneration/tree/main/data
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个经过评估可用于生成标题的多语言数据集,但存在一些重大质量问题,包括数据重复和数据污染。具体来说,包含大量的重复数据对和数据污染问题,这些问题影响了评估指标的准确性。该数据集的规模达到了131万对数据,其任务是标题生成。
This is a multilingual dataset evaluated for title generation, yet it exhibits multiple critical quality problems, namely data duplication and data contamination. Specifically, it contains a large volume of duplicate data pairs and data contamination instances, which undermine the accuracy of evaluation metrics. The dataset comprises 1.31 million data pairs, with its core task being title generation.



