five

ilhamfp/id_puisi

收藏
Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/ilhamfp/id_puisi
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - id license: - mit multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - text2text-generation - text-generation - fill-mask task_ids: [] paperswithcode_id: null pretty_name: Indonesian Puisi tags: - poem-generation dataset_info: features: - name: title dtype: string - name: author dtype: string - name: puisi dtype: string - name: puisi_with_header dtype: string splits: - name: train num_bytes: 10613475 num_examples: 7223 download_size: 10558108 dataset_size: 10613475 --- # Dataset Card for id_puisi ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [puisi-pantun-generator](https://github.com/ilhamfp/puisi-pantun-generator) - **Repository:** [puisi-pantun-generator](https://github.com/ilhamfp/puisi-pantun-generator) - **Paper:** [N/A] - **Leaderboard:** [N/A] - **Point of Contact:** [Ilham Firdausi Putra](ilhamfputra31@gmail.com) ### Dataset Summary Puisi (poem) is an Indonesian poetic form. The dataset contains 7223 Indonesian puisi with its title and author. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages Indonesian ## Dataset Structure ### Data Instances ``` { 'puisi_with_header': 'TEPERANGKAP Oleh Mangku Langit Jingga Mungkin kau membiarkan aku Membiarkan perasaan ini larut Memberi ruang jiwaku hampa Agar tetap terbiasa nikmati Perangkap yang kau buat Perisai yang kau banggakan Takkan jadi tameng bagimu Aku mengerti betapa hebatnya Perangkap mu hei sang dewi Ku akan terus merasa terbiasa Dengan pesona indahmu Ku masih akan nikmati hadirmu Berjalanlah pada hati yang sama Satu hati denganku Walau ku terperangkap Namunku nikmati dan jalani', 'title': 'TEPERANGKAP', 'author': 'Oleh Mangku Langit Jingga', 'puisi': 'Mungkin kau membiarkan aku Membiarkan perasaan ini larut Memberi ruang jiwaku hampa Agar tetap terbiasa nikmati Perangkap yang kau buat Perisai yang kau banggakan Takkan jadi tameng bagimu Aku mengerti betapa hebatnya Perangkap mu hei sang dewi Ku akan terus merasa terbiasa Dengan pesona indahmu Ku masih akan nikmati hadirmu Berjalanlah pada hati yang sama Satu hati denganku Walau ku terperangkap Namunku nikmati dan jalani', } ``` ### Data Fields - `puisi_with_header`: the raw text from scraping - `title`: the title extracted from the raw text using regex - `author`: the author extracted from the raw text using regex - `puisi`: the poem with title and author extracted out using regex ### Data Splits The dataset contains only a train set. ## Dataset Creation ### Curation Rationale The dataset was initially collected as an experiment to generate an Indonesian poem using GPT-2. ### Source Data #### Initial Data Collection and Normalization The dataset was scraped using BeautifulSoup from lokerpuisi.web.id (the data no longer exist on the original blog). The title and author column was produced using regex match from puisi_with_header column. #### Who are the source language producers? The poems were generated by humans. The users of the original blog voluntarily submit their original poems to get published on the blog. ### Annotations #### Annotation process [N/A] #### Who are the annotators? [N/A] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations The regex match used to extract the title & author from the raw text is not perfect. Some title & text is still failed to get extracted. ## Additional Information ### Dataset Curators Ilham Firdausi Putra ### Licensing Information MIT License ### Citation Information [N/A] ### Contributions Thanks to [@ilhamfp](https://github.com/ilhamfp) for adding this dataset.
提供机构:
ilhamfp
原始信息汇总

数据集概述

数据集名称

  • 名称: Indonesian Puisi

数据集基本信息

  • 语言: 印度尼西亚语
  • 许可证: MIT
  • 多语言性: 单语种
  • 大小: 1K<n<10K
  • 来源: 原始数据
  • 任务类别: 文本到文本生成、文本生成、填充掩码

数据集结构

  • 数据实例: 包含标题、作者、诗歌及其带标题的版本
  • 数据字段:
    • puisi_with_header: 原始文本
    • title: 标题
    • author: 作者
    • puisi: 诗歌内容
  • 数据分割: 仅包含训练集,共7223个实例

数据集创建

  • 采集理由: 用于实验生成印度尼西亚诗歌
  • 源数据:
    • 初始数据收集: 通过BeautifulSoup从网站lokerpuisi.web.id抓取(数据已不存在于原博客)
    • 语言生产者: 人类生成
  • 注释: 无注释

使用数据注意事项

  • 已知限制: 使用正则表达式从原始文本中提取标题和作者可能不完全准确

附加信息

  • 数据集创建者: Ilham Firdausi Putra
  • 许可证信息: MIT License
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作