codersan/Persian-Wikipedia-Corpus
收藏Hugging Face2024-11-21 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/codersan/Persian-Wikipedia-Corpus
下载链接
链接失效反馈官方服务:
资源简介:
该数据集来源于波斯维基百科语料库项目,包含了从波斯维基百科中解析的文章。原始数据已被转换为更易于访问的格式,并通过HuggingFace数据集库提供。数据集包含了1,160,676篇有用的文章,每篇文章都有唯一的ID、标题、类型、重要性排名、命名空间、重定向列表、是否为消歧页面、目标链接数量、信息框、文本内容、内部链接和父类别等字段。该数据集特别适用于文本生成、文本检索和文本分类等NLP任务。
This dataset is derived from the Persian Wikipedia Corpus project, which contains parsed articles from the Persian Wikipedia. The original data has been converted into a more accessible format and made available through the HuggingFace datasets library. The dataset contains 1,160,676 useful articles, formatted for easy use with HuggingFaces datasets library and modern NLP pipelines. The dataset features include the unique ID of the article, title, entity type, importance rank, Wikipedia namespace, redirect list, whether it is a disambiguation page, target links count, infobox, text content, internal links, and parent category links.
提供机构:
codersan



