five

riotu-lab/ArabicQA_2.1M

收藏
Hugging Face2024-08-04 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/riotu-lab/ArabicQA_2.1M
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是一个阿拉伯语问答数据集,包含问题、答案和上下文三个特征,数据类型均为字符串。数据集包含一个训练集,大小为1842788736.123851字节,包含2141146个样本。数据集是通过多个过滤后的数据集合并而成,原始数据集共有4,731,600行,经过过滤后减少到2,141,146行。过滤过程包括移除阿拉伯文本少于65%的行、规范化带有变音符号和延长的文本、移除过长文本、过滤多项选择题等。数据集适用于具有短上下文窗口的模型微调。

This dataset is an Arabic question-answering dataset containing three features: question, answer, and context, all of which are of string type. The dataset includes a training set with a size of 1842788736.123851 bytes and contains 2141146 samples. The dataset is an amalgamation of several filtered datasets, with the original datasets totaling 4,731,600 rows, which were reduced to 2,141,146 rows after filtering. The filtering process included removing rows with less than 65% Arabic text, normalizing text with diacritics and elongations, removing excessively long texts, and filtering multiple-choice questions. The dataset is ideal for fine-tuning models with a short context window.
提供机构:
riotu-lab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作