five

akhooli/fineweb2_ar_24_sample_tok

收藏
Hugging Face2025-01-10 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/akhooli/fineweb2_ar_24_sample_tok
下载链接
链接失效反馈
官方服务:
资源简介:
这是一个从Fineeb2阿拉伯子集(arb_Arab)中提取的阿拉伯语样本数据集,大约有230万行数据。数据集首先从总共5780万行数据中筛选出包含超过95%阿拉伯语词汇的行,然后从这些以阿拉伯语为主的数据中随机抽取了230万行样本。请注意,语言分数不是一个准确的度量标准,且数据集中未排除俚语、方言或不适当的内容。该数据集的主要目的是教育,旨在帮助研究人员为FineWeb2数据集(或其他阿拉伯语语料库)设计和发展预处理方法。

This dataset is an Arabic sample extracted from the Fineeb2 Arabic subset (arb_Arab) which is supposed to be standard Arabic. There are around 2.3 million rows in this sample. First, the whole dataset (5.78M rows) was scanned and rows were kept if they had over 95% Arabic words. Then this 2.3M sample was randomly sampled from the mostly Arabic data. Notice that language_score is not an accurate measure. Also, this did not exclude slang, dialects or inappropriate content (no editing was done to any row and all columns were kept). The main purpose of this dataset is educational and I hope it helps researchers in designing and developing pre-processing for the main FineWeb2 dataset (or any other Arabic corpora).
提供机构:
akhooli
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作