G1za/Arabic-Tweets

Name: G1za/Arabic-Tweets
Creator: G1za
Published: 2026-02-09 05:33:39
License: 暂无描述

Hugging Face2026-02-09 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/G1za/Arabic-Tweets

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - ar --- # Dataset Card for Dataset Arabic-Tweets ## Dataset Description - **Homepage:** https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus - **Paper:** https://ieeexplore.ieee.org/document/10022652 ### Dataset Summary This dataset has been collected from twitter which is more than 41 GB of clean data of Arabic Tweets with nearly 4-billion Arabic words (12-million unique Arabic words). ### Languages Arabic ### Source Data Twitter ### Example on data loading using streaming: ```py from datasets import load_dataset dataset = load_dataset("pain/Arabic-Tweets",split='train', streaming=True) print(next(iter(dataset))) ``` ### Example on data loading without streaming "It will be downloaded locally": ```py from datasets import load_dataset dataset = load_dataset("pain/Arabic-Tweets",split='train') print(dataset["train"][0]) ``` #### Initial Data Collection and Normalization The collected data comprises 100 GB of Twitter raw data. Only tweets with Arabic characters were crawled. It was observed that the new data contained a large number of Persian tweets as well as many Arabic words with repeated characters. Because of this and in order to improve the data efficiency the raw data was processed as follows: hashtags, mentions, and links were removed; tweets that contain Persian characters, 3 consecutive characters, or a singlecharacter word were dropped out; normalization of Arabic letters was considered. This has resulted in more than 41 GB of clean data with nearly 4-billion Arabic words (12-million unique Arabic words). ## Considerations for Using the Data - This data has been collected to create a language model. The tweets published without checking the tweets data. Therefore, we are not responsible for any tweets content at all. ### Licensing Information [Creative Commons Attribution](https://creativecommons.org/licenses/by/4.0/) ### Citation Information ``` @INPROCEEDINGS{10022652, author={Al-Fetyani, Mohammad and Al-Barham, Muhammad and Abandah, Gheith and Alsharkawi, Adham and Dawas, Maha}, booktitle={2022 IEEE Spoken Language Technology Workshop (SLT)}, title={MASC: Massive Arabic Speech Corpus}, year={2023}, volume={}, number={}, pages={1006-1013}, doi={10.1109/SLT54892.2023.10022652}} ```

许可证：CC BY 4.0 语言： - 阿拉伯语（ar） # 数据集卡片：阿拉伯推文（Arabic-Tweets） ## 数据集说明 - **主页：** https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus - **相关论文：** https://ieeexplore.ieee.org/document/10022652 ### 数据集概述本数据集采集自Twitter平台，包含超过41GB的清洗后阿拉伯推文数据，总计近40亿个阿拉伯语词汇（涵盖1200万个独特阿拉伯语词汇）。 ### 语言阿拉伯语 ### 源数据来源 Twitter ### 流式加载数据示例 py from datasets import load_dataset dataset = load_dataset("pain/Arabic-Tweets",split='train', streaming=True) print(next(iter(dataset))) ### 非流式加载数据示例（数据将本地下载） py from datasets import load_dataset dataset = load_dataset("pain/Arabic-Tweets",split='train') print(dataset["train"][0]) #### 原始数据采集与归一化处理本次采集的原始Twitter数据总量达100GB，仅抓取包含阿拉伯字符的推文。但经检测发现，采集得到的原始数据中混杂大量波斯语推文，同时存在诸多带有重复字符的阿拉伯语词汇。为提升数据利用效率，我们对原始数据进行了如下预处理：移除话题标签、@提及链接；过滤包含波斯语字符、连续三个相同字符或单字符词汇的推文；并对阿拉伯字母进行归一化处理。经上述处理后，最终得到超过41GB的清洗后数据集，包含近40亿个阿拉伯语词汇（涵盖1200万个独特阿拉伯语词汇）。 ## 数据使用注意事项 - 本数据集采集用于构建语言模型，且未对推文内容进行审核。因此，我们对所有推文内容均不承担责任。 ### 许可证信息 [知识共享署名许可协议（Creative Commons Attribution）](https://creativecommons.org/licenses/by/4.0/) ### 引用信息 @INPROCEEDINGS{10022652, author={Al-Fetyani, Mohammad and Al-Barham, Muhammad and Abandah, Gheith and Alsharkawi, Adham and Dawas, Maha}, booktitle={2022 IEEE Spoken Language Technology Workshop (SLT)}, title={MASC: Massive Arabic Speech Corpus}, year={2023}, volume={}, number={}, pages={1006-1013}, doi={10.1109/SLT54892.2023.10022652}}

提供机构：

G1za

5,000+

优质数据集

54 个

任务类型

进入经典数据集