ImruQays/16-million-raw-arabic-words
收藏Hugging Face2025-02-23 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ImruQays/16-million-raw-arabic-words
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含16,052,878个唯一的阿拉伯语单词,来源于Shamela和Hindawi两个图书馆的大型文本语料库。每个条目都是一个不同的阿拉伯语单词,重复的单词已被移除。数据集对不同的词尾符号视为不同的单词,未完全清洗,可能包含标点符号、符号、数字以及其他非字母数字字符。数据集以每行一个单词的形式排列。
This dataset contains 16,052,878 unique Arabic words extracted from a large corpus of text from the Shamela and Hindawi libraries. Each entry is a distinct Arabic word, with duplicates removed. The dataset is sensitive to different diacritical markings, is not fully cleaned, and may include punctuation, symbols, numbers, and other non-alphanumeric characters. It is structured with one unique word per line.
提供机构:
ImruQays



