chuuhtetnaing/myanmar-wikipedia-dataset

Name: chuuhtetnaing/myanmar-wikipedia-dataset
Creator: chuuhtetnaing
Published: 2025-03-26 07:27:33
License: 暂无描述

Hugging Face2025-03-26 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/chuuhtetnaing/myanmar-wikipedia-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

缅甸维基百科数据集是一个基于类别路径组织的抓取自缅甸维基百科页面的集合。该数据集包含了根据类别组织结构抓取的缅甸维基百科文章。与官方的Wikimedia数据集不同，这个数据集提供了一个从主入口页面开始的类别结构的替代方法。数据集的列包括文章标题、全文内容、关联类别标签、唯一URL、最后修改时间、类别路径、类别路径计数、标题和内容的独特单词和短语列表，以及只包含缅甸字符的特殊字符的独特单词和短语列表。数据集适用于缅甸语言NLP任务的研究，构建从类别理解中受益的缅甸语言模型，以及对缅甸语言内容进行主题建模。

The Myanmar Wikipedia Dataset is a collection of scraped Myanmar Wikipedia pages organized by category paths. The dataset contains Myanmar Wikipedia articles scraped based on categorical organization. Unlike the official Wikimedia dataset, this dataset provides an alternative approach to Myanmar Wikipedia content by following the categorical structure starting from the main entry page. The columns of the dataset include the title of the Wikipedia article, full text content, associated category tags, unique URL, last modified timestamp, complete hierarchical category path, category path count, list of individual segmented words and phrases from the title and content, and list of unique words and phrases containing only Myanmar characters with some special characters. The dataset is particularly useful for research on Myanmar language NLP tasks, building Myanmar-language models that benefit from categorical understanding, and topic modeling of Myanmar-language content.

提供机构：

chuuhtetnaing

5,000+

优质数据集

54 个

任务类型

进入经典数据集