indonesian-nlp/id_newspapers_2018
收藏Hugging Face2022-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/indonesian-nlp/id_newspapers_2018
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- id
license:
- cc-by-4.0
multilinguality:
- monolingual
size_categories:
- 100K<n<1M
source_datasets:
- original
task_categories:
- text-generation
task_ids:
- language-modeling
pretty_name: Indonesian Newspapers 2018
---
# Dataset of Indonesian Online Newspaper
This is a copy of dataset created by **Feryandi Nurdiantoro** (<https://github.com/feryandi/Dataset-Artikel>). The original dataset in json format is stored uncompressed in Google Drive in more than 500K files, one file per article. Unfortunately, due to its size, it is impossible to download the whole dataset as one big compressed file (it takes forever to compress it online). Therefore I provide here a copy and its cleaned version as compressed files.
The dataset contains around 500K articles (136M of words) from 7 Indonesian newspapers: Detik, Kompas, Tempo, CNN Indonesia, Sindo, Republika and Poskota. The articles are dated between 1st January 2018 and 20th August 2018 (with few exceptions dated earlier). The size of uncompressed 500K json files (newspapers-json.tgz) is around 2.2GB, and the cleaned uncompressed in a big text file (newspapers.txt.gz) is about 1GB. The original source in Google Drive contains also a dataset in html format which include raw data (pictures, css, javascript, ...) from the online news website. I don't copy it here since it is about 60GB and mostly we only need the text content for NLP research.
Following is the compressed files:
* newspaper-json.gz: the compressed original 500K json files.
* newspaper.txt.gz: a dump of all json files in one big cleaned text file which is normally the only one needed for language model training.
The license has been copied from the source:
## License
Proyek ini dilisensikan dibawah lisensi **Creative Commons Attribution-ShareAlike 4.0 International License**\*. Kumpulan data yang dibagikan bertujuan untuk ilmu pengetahuan, pembelajaran, dan penelitian Bahasa Indonesia (komputasi maupun lingusitik), dan hanya dapat digunakan untuk hal tersebut. Kepemilikan data untuk setiap artikel dimiliki oleh media yang bersangkutan dimana data tersebut diambil; dan pemilik repository ini tidak melakukan klaim kepemilikan atas konten tersebut. Jika Anda mendapati bahwa data ini telah melanggar suatu hak cipta; mohon kontak pengelola repository ini.
This work is licensed under a **Creative Commons Attribution-ShareAlike 4.0 International License**. The dataset is shared for the sole purpose of aiding open scientific research in Bahasa Indonesia (computing or linguistics), and can only be used for that purpose. The ownership of each article within the dataset belongs to the respective newspaper from which it was extracted; and the maintainer of the repository does not claim ownership of any of the content within it. If you think, by any means, that this dataset breaches any established copyrights; please contact the repository maintainer.
提供机构:
indonesian-nlp
原始信息汇总
数据集概述
基本信息
- 语言: 印度尼西亚语 (id)
- 许可证: CC-BY-4.0
- 多语言性: 单语种
- 大小: 100K<n<1M
- 数据来源: 原始数据
任务类型
- 任务类别: 文本生成
- 任务ID: 语言建模
数据集内容
- 名称: Indonesian Newspapers 2018
- 包含内容: 约500,000篇文章,来自7家印度尼西亚报纸:Detik, Kompas, Tempo, CNN Indonesia, Sindo, Republika, Poskota。文章日期范围为2018年1月1日至2018年8月20日。
- 文件格式:
newspaper-json.gz: 原始500,000个JSON文件的压缩版本。newspaper.txt.gz: 所有JSON文件合并后的清洁文本文件,主要用于语言模型训练。
许可证详情
- 许可证: Creative Commons Attribution-ShareAlike 4.0 International License
- 使用目的: 仅用于支持印度尼西亚语(计算或语言学)的开放科学研究。
- 版权声明: 每篇文章的所有权归属于相应的报纸,数据集维护者不声称拥有任何内容的所有权。



