five

Hindi and Marathi Question-Answering Dataset

收藏
arXiv2024-02-17 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2308.09862v3
下载链接
链接失效反馈
官方服务:
资源简介:
本研究针对印度的两种低资源语言——印地语和马拉地语,开发了一个大规模的问答数据集。该数据集包含28,000个样本,旨在解决这两种语言在构建高效问答系统时面临的数据稀缺问题。数据集通过将SQuAD 2.0数据集翻译成印地语和马拉地语创建,适用于自然语言理解和机器学习应用,特别是针对印地语和马拉地语社区的需求。创建过程中,研究团队采用了一种新颖的方法来确定答案在上下文中的准确索引,确保了数据集的质量和实用性。

This study develops a large-scale question answering (QA) dataset targeting two low-resource languages of India: Hindi and Marathi. This dataset consists of 28,000 samples, aiming to address the data scarcity issue faced when building high-performance QA systems for these two languages. Constructed by translating the SQuAD 2.0 dataset into Hindi and Marathi, the dataset is suitable for natural language understanding (NLU) and machine learning applications, particularly catering to the needs of Hindi and Marathi-speaking communities. During the dataset creation process, the research team adopted a novel method to determine the accurate index of the answer within the context, ensuring the quality and practicality of the dataset.
提供机构:
南加州大学
创建时间:
2023-08-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作