Omartificial-Intelligence-Space/FineWeb2-MSA

Name: Omartificial-Intelligence-Space/FineWeb2-MSA
Creator: Omartificial-Intelligence-Space
Published: 2024-12-15 11:17:57
License: 暂无描述

Hugging Face2024-12-15 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/Omartificial-Intelligence-Space/FineWeb2-MSA

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: odc-by language: - ar tags: - arabicf - fineweb - MSA pretty_name: FineWeb2 MSA size_categories: - 10M<n<100M --- # FineWeb2 MSA Arabic ![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/7QWU4U2orwaXAZGC3lWy0.png) This is the MSA Arabic Portion of The [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2#additional-information) Dataset. This dataset contains a rich collection of text in **MSA Arabic** (ISO 639-3: arz), a widely spoken dialect within the Afro-Asiatic language family. With over **439 million words** and **1.4 million** documents, it serves as a valuable resource for NLP development and linguistic research focused on Egyptian Arabic. ## Purpose of This Repository This repository provides easy access to the **Arabic portion - MSA** of the extensive **FineWeb2** dataset. My primary goal is to make this valuable data more accessible and impactful for researchers, developers, and anyone working on **Arabic** natural language processing (NLP) projects. By focusing on Arabic, I aim to: - **Simplify Access**: Provide a direct and streamlined way to download the Arabic portion of the dataset without navigating through the larger collection. - **Promote Research**: Enable more efficient use of Arabic text data for NLP, LLMs, and linguistic research. - **Empower the Community**: Support Arabic language processing and contribute to the growth of multilingual NLP capabilities. - **Encourage Collaboration**: Foster an environment where researchers and developers can build impactful applications using Arabic data. ## Credit to the Original Work The dataset is released under the [Open Data Commons Attribution License (ODC-By) v1.0](https://opendatacommons.org/licenses/by/1-0/), with additional usage subject to CommonCrawl's Terms of Use.. ### Citation If you use this dataset, please cite it as follows: ```bibtex @software{penedo2024fineweb-2, author = {Penedo, Guilherme and Kydlíček, Hynek and Sabolčec, Vinko and Messmer, Bettina and Foroutan, Negar and Jaggi, Martin and von Werra, Leandro and Wolf, Thomas}, title = {FineWeb2: A sparkling update with 1000s of languages}, month = dec, year = 2024, doi = {10.57967/hf/3744}, url = {https://huggingface.co/datasets/HuggingFaceFW/fineweb-2} }

提供机构：

Omartificial-Intelligence-Space

5,000+

优质数据集

54 个

任务类型

进入经典数据集