five

sabilmakbar/indo_wiki

收藏
Hugging Face2023-11-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sabilmakbar/indo_wiki
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - crowdsourced language: - ace - ban - bjn - bug - gor - id - jv - mis - min - ms - nia - su - tet license: - cc-by-sa-3.0 - gfdl multilinguality: - multilingual source_datasets: - Wikipedia-HF task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling pretty_name: Wikipedia Archive for Indonesian Languages & Local Languages tags: - Wikipedia - Indonesian - Sundanese - Javanese - Malay - Dialect - Javanese Dialect (Banyumase/Ngapak) - Indonesian Language - Malay Language - Indonesia-related Languages - Indonesian Local Languages dataset_info: - config_name: indowiki_all features: - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: ace num_bytes: 4875688 num_examples: 12932 - name: ban num_bytes: 17561379 num_examples: 20243 - name: bjn num_bytes: 6669628 num_examples: 10460 - name: bug num_bytes: 3297641 num_examples: 15877 - name: gor num_bytes: 6007726 num_examples: 14572 - name: id num_bytes: 1103106307 num_examples: 657990 - name: jv num_bytes: 70335030 num_examples: 73150 - name: map_bms num_bytes: 5215803 num_examples: 13574 - name: min num_bytes: 116481049 num_examples: 227024 - name: ms num_bytes: 416001194 num_examples: 367463 - name: nia num_bytes: 1938378 num_examples: 1651 - name: su num_bytes: 47489084 num_examples: 61557 - name: tet num_bytes: 1452716 num_examples: 1465 download_size: 1803193334 dataset_size: 1800431623 - config_name: indowiki_dedup_all features: - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: ace num_bytes: 4867838 num_examples: 12904 - name: ban num_bytes: 17366080 num_examples: 19837 - name: bjn num_bytes: 6655378 num_examples: 10437 - name: bug num_bytes: 2072609 num_examples: 9793 - name: gor num_bytes: 5989252 num_examples: 14514 - name: id num_bytes: 1100932403 num_examples: 654287 - name: jv num_bytes: 69774853 num_examples: 72667 - name: map_bms num_bytes: 5060989 num_examples: 11832 - name: min num_bytes: 116376870 num_examples: 225858 - name: ms num_bytes: 410443550 num_examples: 346186 - name: nia num_bytes: 1938121 num_examples: 1650 - name: su num_bytes: 47410439 num_examples: 61494 - name: tet num_bytes: 1447926 num_examples: 1460 download_size: 1793103024 dataset_size: 1790336308 - config_name: indowiki_dedup_id_only features: - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1100932403 num_examples: 654287 download_size: 1103131493 dataset_size: 1100932403 --- # **Indonesian Wikipedia Data Repository** --- license: cc-by-sa-3.0 --- Welcome to Indonesian Wikipedia Data Repository. The datasets are extracted from [Wikipedia HF](https://huggingface.co/datasets/wikipedia) and processed using the scripts available in this repository for reproducibility purpose. # **FAQS** ### What are the available languages provided in dataset? Please check the following table. | Lang Code | Lang Desc | Wiki Info | Total Data | Total Size (bytes) | | :---: | :----: | :--- | ---: | ---: | | ace | Acehnese | [Wiki Link](https://en.wikipedia.org/wiki/Acehnese_language) | 12904 | 4867838 | | ban | Balinese | [Wiki Link](https://en.wikipedia.org/wiki/Balinese_language) | 19837 | 17366080 | | bjn | Acehnese | [Wiki Link](https://en.wikipedia.org/wiki/Banjarese_language) | 10437 | 6655378 | | bug | Buginese | [Wiki Link](https://en.wikipedia.org/wiki/Buginese_language) | 9793 | 2072609 | | gor | Gorontalo | [Wiki Link](https://en.wikipedia.org/wiki/Gorontalo_language) | 14514 | 5989252 | | id | Indonesian | [Wiki Link](https://en.wikipedia.org/wiki/Indonesian_language) | 654287 | 1100932403 | | jv | Javanese | [Wiki Link](https://en.wikipedia.org/wiki/Javanese_language) | 72667 | 69774853 | | map_bms | Banyumasan <br />(Dialect of Javanese) | [Wiki Link](https://en.wikipedia.org/wiki/Banyumasan_dialect) | 11832 | 5060989 | | min | Minangkabau | [Wiki Link](https://en.wikipedia.org/wiki/Minangkabau_language) | 225858 | 116376870 | | ms | Malay | [Wiki Link](https://en.wikipedia.org/wiki/Malay_language) | 346186 | 410443550 | | nia | Nias | [Wiki Link](https://en.wikipedia.org/wiki/Nias_language) | 1650 | 1938121 | | su | Sundanese | [Wiki Link](https://en.wikipedia.org/wiki/Sundanese_language) | 61494 | 47410439 | | tet | Tetum | [Wiki Link](https://en.wikipedia.org/wiki/Tetum_language) | 1465 | 1452716 | ### How do I extract new Wikipedia Dataset of Indonesian languages? You may check to the script [_```extract_raw_wiki_data.py```_](https://huggingface.co/datasets/sabilmakbar/indo_wiki/blob/main/extract_raw_wiki_data.py) to understand its implementations, or you can adjust the bash provided in [_```extract_raw_wiki_data_indo.sh```_](https://huggingface.co/datasets/sabilmakbar/indo_wiki/blob/main/extract_raw_wiki_data_indo.sh) to extract it on your own. Please note that this dataset is extensible to any languages of your choice. ### How do I extract new Wikipedia Dataset of Indonesian languages? You may visit this [Wikipedia Dump Index](https://dumps.wikimedia.org/backup-index.html) to check any latest available data and this link [Wikipedia Language Coverage](https://meta.wikimedia.org/wiki/List_of_Wikipedias#All_Wikipedias_ordered_by_number_of_articles) to map into any languages that you're wanting to extract. ### How does the data being preprocessed? What makes it different from loading it directly from Wikipedia HF? The data available in here are processed with following flows: 1. Raw data is being deduplicated on ```title``` and ```text``` (text-content from a given article), to remove articles containing boilerplate text (template text that are used usually for no-available informations or asking for contributions of content in that article), which usually deemed noisy for NLP data. 2. Furthermore, the ```title``` and ```text``` data are being checked for string-matching duplication (duplication of text that are being pre-processed, i.e symbols removed, HTML tags striped, or ASCII-chars validated). You may check this [ ```cleanse_wiki_data.py```](https://huggingface.co/datasets/sabilmakbar/indo_wiki/blob/main/cleanse_wiki_data.py) script to understand its implementation. # Getting Started # ### To read the datasets directly ### Use one of the following code chunks to load it from HuggingFace Hub: You can refer to the 2nd args of ```config name``` using the following script ``` dataset = load_dataset( "sabilmakbar/indo_wiki", "indo_wiki_dedup_data" # a config name, can be "indo_wiki_raw_data" or "indowiki_dedup_id_only", defaults to "indo_wiki_dedup_data" ) ``` Or you can provide both ```lang``` and ```date_stamp``` (providing only one will thrown an error) ``` dataset = load_dataset( "sabilmakbar/indo_wiki", lang = "id", # see the splits for complete lang choices date_stamp="20230901" ) ``` ### To replicate the whole dataset generation process ### 1. Set-up a new Python/Conda Environment (recommended Python version: 3.9.6 to 3.9.18 or 3.10.0 to 3.10.13) and install the requirements on ```requirements.txt``` use this codebase via ```pip install -r requirements.txt```. 2. Activate the chosen Python/Conda environment which the requirements are being installed. 3. Run this ```sh``` script for extractions from Wikimedia Dump: ```sh extract_raw_wiki_data_indo.sh```. 4. Run this ```sh``` script of deduplication: ```sh dedup_raw_wiki_data_indo.sh```. ## Citation Info: ``` @ONLINE{wikidump, author = "Wikimedia Foundation", title = "Wikimedia Downloads", url = "https://dumps.wikimedia.org"} @ONLINE{wikipedia-hf, title = "Huggingface Wikipedia Dataset", url = "https://huggingface.co/datasets/wikipedia"} ```
提供机构:
sabilmakbar
原始信息汇总

数据集概述

基本信息

  • 数据集名称: Wikipedia Archive for Indonesian Languages & Local Languages
  • 标签: Wikipedia, Indonesian, Sundanese, Javanese, Malay, Dialect, Javanese Dialect (Banyumase/Ngapak), Indonesian Language, Malay Language, Indonesia-related Languages, Indonesian Local Languages
  • 许可证: cc-by-sa-3.0, gfdl
  • 多语言性: 多语言
  • 源数据集: Wikipedia-HF
  • 任务类别: 文本生成, 填充掩码
  • 任务ID: 语言建模, 掩码语言建模

语言信息

  • 支持语言: ace, ban, bjn, bug, gor, id, jv, mis, min, ms, nia, su, tet

数据集配置

配置: indowiki_all

  • 特征:
    • url: string
    • title: string
    • text: string
  • 分割:
    • ace: 12932个样本, 4875688字节
    • ban: 20243个样本, 17561379字节
    • bjn: 10460个样本, 6669628字节
    • bug: 15877个样本, 3297641字节
    • gor: 14572个样本, 6007726字节
    • id: 657990个样本, 1103106307字节
    • jv: 73150个样本, 70335030字节
    • map_bms: 13574个样本, 5215803字节
    • min: 227024个样本, 116481049字节
    • ms: 367463个样本, 416001194字节
    • nia: 1651个样本, 1938378字节
    • su: 61557个样本, 47489084字节
    • tet: 1465个样本, 1452716字节
  • 下载大小: 1803193334字节
  • 数据集大小: 1800431623字节

配置: indowiki_dedup_all

  • 特征:
    • url: string
    • title: string
    • text: string
  • 分割:
    • ace: 12904个样本, 4867838字节
    • ban: 19837个样本, 17366080字节
    • bjn: 10437个样本, 6655378字节
    • bug: 9793个样本, 2072609字节
    • gor: 14514个样本, 5989252字节
    • id: 654287个样本, 1100932403字节
    • jv: 72667个样本, 69774853字节
    • map_bms: 11832个样本, 5060989字节
    • min: 225858个样本, 116376870字节
    • ms: 346186个样本, 410443550字节
    • nia: 1650个样本, 1938121字节
    • su: 61494个样本, 47410439字节
    • tet: 1460个样本, 1447926字节
  • 下载大小: 1793103024字节
  • 数据集大小: 1790336308字节

配置: indowiki_dedup_id_only

  • 特征:
    • url: string
    • title: string
    • text: string
  • 分割:
    • train: 654287个样本, 1100932403字节
  • 下载大小: 1103131493字节
  • 数据集大小: 1100932403字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作