five

Omartificial-Intelligence-Space/FineWeb2-MSA

收藏
Hugging Face2024-12-15 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/Omartificial-Intelligence-Space/FineWeb2-MSA
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by language: - ar tags: - arabicf - fineweb - MSA pretty_name: FineWeb2 MSA size_categories: - 10M<n<100M --- # FineWeb2 MSA Arabic ![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/7QWU4U2orwaXAZGC3lWy0.png) This is the MSA Arabic Portion of The [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2#additional-information) Dataset. This dataset contains a rich collection of text in **MSA Arabic** (ISO 639-3: arz), a widely spoken dialect within the Afro-Asiatic language family. With over **439 million words** and **1.4 million** documents, it serves as a valuable resource for NLP development and linguistic research focused on Egyptian Arabic. ## Purpose of This Repository This repository provides easy access to the **Arabic portion - MSA** of the extensive **FineWeb2** dataset. My primary goal is to make this valuable data more accessible and impactful for researchers, developers, and anyone working on **Arabic** natural language processing (NLP) projects. By focusing on Arabic, I aim to: - **Simplify Access**: Provide a direct and streamlined way to download the Arabic portion of the dataset without navigating through the larger collection. - **Promote Research**: Enable more efficient use of Arabic text data for NLP, LLMs, and linguistic research. - **Empower the Community**: Support Arabic language processing and contribute to the growth of multilingual NLP capabilities. - **Encourage Collaboration**: Foster an environment where researchers and developers can build impactful applications using Arabic data. ## Credit to the Original Work The dataset is released under the [Open Data Commons Attribution License (ODC-By) v1.0](https://opendatacommons.org/licenses/by/1-0/), with additional usage subject to CommonCrawl's Terms of Use.. ### Citation If you use this dataset, please cite it as follows: ```bibtex @software{penedo2024fineweb-2, author = {Penedo, Guilherme and Kydlíček, Hynek and Sabolčec, Vinko and Messmer, Bettina and Foroutan, Negar and Jaggi, Martin and von Werra, Leandro and Wolf, Thomas}, title = {FineWeb2: A sparkling update with 1000s of languages}, month = dec, year = 2024, doi = {10.57967/hf/3744}, url = {https://huggingface.co/datasets/HuggingFaceFW/fineweb-2} }
提供机构:
Omartificial-Intelligence-Space
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作