TokenHaven/FineWeb-Edu-Arabic
收藏Hugging Face2025-07-30 更新2025-11-30 收录
下载链接:
https://hf-mirror.com/datasets/TokenHaven/FineWeb-Edu-Arabic
下载链接
链接失效反馈官方服务:
资源简介:
这是一个包含大量高质量阿拉伯语文本数据及其元数据的集合。数据集由英语常见网络爬虫数据筛选而来,使用教育评分4或更高的[FineWeb-Edu分类器](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)进行过滤。数据源自[HuggingFaceFW/fineweb-edu](https://huggingface.co/HuggingFaceFW/fineweb-edu)数据集的v1.0.0版本,对应于common crawl的CC-MAIN-2024-10。数据已完全去重,并使用[WebOrganizer分类器](https://huggingface.co/WebOrganizer)标记了主题和格式,然后只保留了特定的文档格式。所有文档都已从英语翻译成阿拉伯语,并删除了网络爬虫痕迹,使用markdown格式进行了重新格式化,确保文本的高质量和清洁。
This dataset contains a large collection of high-quality Arabic text data with their metadata. The dataset was created by filtering all `English` common crawl data for high-quality text using the [FineWeb-Edu classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) with education score of 4 or higher over 5. The data is source from the `v1.0.0` of the [HuggingFaceFW/fineweb-edu](https://huggingface.co/HuggingFaceFW/fineweb-edu) dataset which corresponds to `CC-MAIN-2024-10` from common crawl. The data was also fully deduplicated and labeled for Topic and Format using the [WebOrganizer Classifiers](https://huggingface.co/WebOrganizer), and then we only keep documents specific formats (list below). All documents were then translated from English to Arabic using the [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) LLM model, while also removing any webscraping artifacts and reformating the output text using markdown (added headings, lists, or other formatting elements to improve readability), ensuring the text is high quality and clean. The LLM was also used to generate a title if the document did not have one.
提供机构:
TokenHaven



