ilhamfp/id_puisi
收藏Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/ilhamfp/id_puisi
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- id
license:
- mit
multilinguality:
- monolingual
size_categories:
- 1K<n<10K
source_datasets:
- original
task_categories:
- text2text-generation
- text-generation
- fill-mask
task_ids: []
paperswithcode_id: null
pretty_name: Indonesian Puisi
tags:
- poem-generation
dataset_info:
features:
- name: title
dtype: string
- name: author
dtype: string
- name: puisi
dtype: string
- name: puisi_with_header
dtype: string
splits:
- name: train
num_bytes: 10613475
num_examples: 7223
download_size: 10558108
dataset_size: 10613475
---
# Dataset Card for id_puisi
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [puisi-pantun-generator](https://github.com/ilhamfp/puisi-pantun-generator)
- **Repository:** [puisi-pantun-generator](https://github.com/ilhamfp/puisi-pantun-generator)
- **Paper:** [N/A]
- **Leaderboard:** [N/A]
- **Point of Contact:** [Ilham Firdausi Putra](ilhamfputra31@gmail.com)
### Dataset Summary
Puisi (poem) is an Indonesian poetic form. The dataset contains 7223 Indonesian puisi with its title and author.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
Indonesian
## Dataset Structure
### Data Instances
```
{
'puisi_with_header': 'TEPERANGKAP
Oleh Mangku Langit Jingga
Mungkin kau membiarkan aku
Membiarkan perasaan ini larut
Memberi ruang jiwaku hampa
Agar tetap terbiasa nikmati
Perangkap yang kau buat
Perisai yang kau banggakan
Takkan jadi tameng bagimu
Aku mengerti betapa hebatnya
Perangkap mu hei sang dewi
Ku akan terus merasa terbiasa
Dengan pesona indahmu
Ku masih akan nikmati hadirmu
Berjalanlah pada hati yang sama
Satu hati denganku
Walau ku terperangkap
Namunku nikmati dan jalani',
'title': 'TEPERANGKAP',
'author': 'Oleh Mangku Langit Jingga',
'puisi': 'Mungkin kau membiarkan aku
Membiarkan perasaan ini larut
Memberi ruang jiwaku hampa
Agar tetap terbiasa nikmati
Perangkap yang kau buat
Perisai yang kau banggakan
Takkan jadi tameng bagimu
Aku mengerti betapa hebatnya
Perangkap mu hei sang dewi
Ku akan terus merasa terbiasa
Dengan pesona indahmu
Ku masih akan nikmati hadirmu
Berjalanlah pada hati yang sama
Satu hati denganku
Walau ku terperangkap
Namunku nikmati dan jalani',
}
```
### Data Fields
- `puisi_with_header`: the raw text from scraping
- `title`: the title extracted from the raw text using regex
- `author`: the author extracted from the raw text using regex
- `puisi`: the poem with title and author extracted out using regex
### Data Splits
The dataset contains only a train set.
## Dataset Creation
### Curation Rationale
The dataset was initially collected as an experiment to generate an Indonesian poem using GPT-2.
### Source Data
#### Initial Data Collection and Normalization
The dataset was scraped using BeautifulSoup from lokerpuisi.web.id (the data no longer exist on the original blog). The title and author column was produced using regex match from puisi_with_header column.
#### Who are the source language producers?
The poems were generated by humans. The users of the original blog voluntarily submit their original poems to get published on the blog.
### Annotations
#### Annotation process
[N/A]
#### Who are the annotators?
[N/A]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
The regex match used to extract the title & author from the raw text is not perfect. Some title & text is still failed to get extracted.
## Additional Information
### Dataset Curators
Ilham Firdausi Putra
### Licensing Information
MIT License
### Citation Information
[N/A]
### Contributions
Thanks to [@ilhamfp](https://github.com/ilhamfp) for adding this dataset.
提供机构:
ilhamfp
原始信息汇总
数据集概述
数据集名称
- 名称: Indonesian Puisi
数据集基本信息
- 语言: 印度尼西亚语
- 许可证: MIT
- 多语言性: 单语种
- 大小: 1K<n<10K
- 来源: 原始数据
- 任务类别: 文本到文本生成、文本生成、填充掩码
数据集结构
- 数据实例: 包含标题、作者、诗歌及其带标题的版本
- 数据字段:
puisi_with_header: 原始文本title: 标题author: 作者puisi: 诗歌内容
- 数据分割: 仅包含训练集,共7223个实例
数据集创建
- 采集理由: 用于实验生成印度尼西亚诗歌
- 源数据:
- 初始数据收集: 通过BeautifulSoup从网站lokerpuisi.web.id抓取(数据已不存在于原博客)
- 语言生产者: 人类生成
- 注释: 无注释
使用数据注意事项
- 已知限制: 使用正则表达式从原始文本中提取标题和作者可能不完全准确
附加信息
- 数据集创建者: Ilham Firdausi Putra
- 许可证信息: MIT License



