pythainlp/thai_wikipedia_clean_20230101
收藏Hugging Face2023-05-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pythainlp/thai_wikipedia_clean_20230101
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 686139541
num_examples: 1436054
download_size: 260540997
dataset_size: 686139541
license: cc-by-sa-3.0
task_categories:
- text-generation
language:
- th
---
# Dataset Card for "thai_wikipedia_clean_20230101"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
Thai Wikipedia Database dumps to plain text for NLP work.
This dataset was dump on 1 January 2023 from [Thai wikipedia](https://th.wikipedia.org).
- GitHub: [PyThaiNLP / ThaiWiki-clean](https://github.com/PyThaiNLP/ThaiWiki-clean)
- Notebook for upload to HF: [https://github.com/PyThaiNLP/ThaiWiki-clean/blob/main/thai_wikipedia_clean_20230101_hf.ipynb](https://github.com/PyThaiNLP/ThaiWiki-clean/blob/main/thai_wikipedia_clean_20230101_hf.ipynb)
提供机构:
pythainlp
原始信息汇总
数据集概述
基本信息
- 数据集名称: thai_wikipedia_clean_20230101
- 许可证: cc-by-sa-3.0
数据特征
- 特征名称: text
- 数据类型: string
数据分割
- 训练集
- 样本数量: 1436054
- 存储大小: 686139541字节
下载信息
- 下载大小: 260540997字节
- 数据集总大小: 686139541字节
任务类别
- text-generation
语言
- th (泰语)



