five

Rootport/nz-traditional-haiku

收藏
Hugging Face2026-05-26 更新2026-05-31 收录
下载链接:
https://hf-mirror.com/datasets/Rootport/nz-traditional-haiku
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ja license: cc-by-4.0 size_categories: - 10K<n<100K task_categories: - text-generation tags: - haiku - japanese - poetry - classical-japanese pretty_name: Traditional Japanese Haiku Dataset (Edo-Meiji Era) --- # Traditional Japanese Haiku Dataset (Edo–Meiji Era) > ⚠️ **Work in Progress — Pre-Release Draft** > > This dataset is **not yet ready for general use**. It is being uploaded primarily as a personal backup snapshot during active development. Schema, annotations, and documentation may change without notice. Approximately **22% of records (~2,800)** are still flagged `annotation_status = "needs_review"` and have **not yet undergone human review**. > > If you arrived here unexpectedly, please check back later — this notice will be removed when the dataset is ready for downstream use. > > このデータセットは **まだ製作中** です。現時点では作者本人のバックアップを兼ねた公開であり、スキーマ・アノテーション・ドキュメントは予告なく変更される可能性があります。約 22%(2,800 件)のレコードに `annotation_status = "needs_review"` が付いており、**人間レビュー未反映** の状態です。 --- ## Dataset Summary A machine-readable corpus of **12,787 classical Japanese haiku** from seven public-domain poets spanning the late Edo to early Showa eras (1644–1959). Each record carries: - Original source-text spelling (`text`) preserving historical orthography - Modern hiragana reading (`kana`) and 5-7-5 segmentation (`segments`, `mora_counts`) - Seasonal classification (`season`) and **dual-form kigo annotation** keeping both the source-text spelling AND the standard saijiki headword - Form metadata (`form`, `is_regular_575`) distinguishing fixed-form haiku from free verse (Santōka, Hōsai) - Author metadata with birth/death years and per-poet style tags Designed primarily as training data for **haiku generation models** — including diffusion-based approaches — where the prompt input may vary from formal saijiki headwords (`時雨` / `しぐれ`) to descriptive natural language (`寒い夜`). ## 概要 江戸後期から昭和初期(1644–1959)までの著作権切れ俳人 **7 名・計 12,787 句** をアノテーションした機械可読データセット。各レコードに現代仮名読み・モーラ分割・季語(本文表記 + 歳時記見出し語の二重保持)・作者メタを付与。**俳句生成モデル(Diffusion 系含む)** の訓練データを主用途として設計。 --- ## Dataset Structure ### Data Fields | Field | Type | Description | |---|---|---| | `id` | string | Unique ID, format `author_slug_NNNNNN` (e.g. `matsuo_basho_000001`) | | `text` | string | Original haiku text, preserving source orthography (旧仮名遣い含む) | | `text_spaced` | string | Text with half-width space at 5-7-5 boundaries | | `kana` | string | Modern hiragana reading, segments joined by half-width spaces | | `normalized_kana` | string | Normalized hiragana for training (typically identical to `kana`) | | `segments` | list[string] | Reading split into 5-7-5 segments (1–3 elements for free verse) | | `mora_counts` | list[int] | Mora count per segment; sums to `total_mora_count` | | `total_mora_count` | int | Sum of `mora_counts` | | `form` | string | One of `5-7-5`, `5-7-5-extra`, `free`, `unknown` | | `is_regular_575` | bool | True iff `mora_counts == [5, 7, 5]` | | `season` | string | One of `spring`, `summer`, `autumn`, `winter`, `new_year`, `nonseasonal`, `unknown` | | `kigo` | list[dict] | Seasonal words; each entry has `word`, `kana`, `canonical_word`, `canonical_kana`, `season`, `category` | | `author` | string | Poet name in kanji | | `author_kana` | string | Poet name in hiragana | | `author_birth_year` | int | | | `author_death_year` | int | | | `source` | string | Source publication | | `mood`, `mood_ja` | list[string] | Mood labels (currently placeholder `["unknown"]` / `["不明"]`) | | `scene`, `scene_ja` | list[string] | Scene/motif labels; `scene_ja` populated, `scene` placeholder | | `style`, `style_ja` | list[string] | Per-author default style tags | | `rights` | dict | `{text_status: "public_domain", annotation_license: "cc-by-4.0"}` | | `annotation_status` | string | One of `ai_generated`, `needs_review`, `verified` | ### Mora Counting - Diphthong glides (ょ・ゅ・ゃ・ぁ etc.) count as **0 mora**. - Geminate (っ), syllabic n (ん), and long mark (ー) each count as **1 mora**. - Example: 「ちょうちん」 = ち + (ょ) + う + ち + ん = **4 mora**. ### Example Instance ```json { "id": "matsuo_basho_001234", "text": "古池や蛙飛び込む水の音", "text_spaced": "古池や 蛙飛び込む 水の音", "kana": "ふるいけや かわずとびこむ みずのおと", "normalized_kana": "ふるいけや かわずとびこむ みずのおと", "segments": ["ふるいけや", "かわずとびこむ", "みずのおと"], "mora_counts": [5, 7, 5], "total_mora_count": 17, "form": "5-7-5", "is_regular_575": true, "season": "spring", "kigo": [ {"word": "蛙", "kana": "かわず", "canonical_word": "蛙", "canonical_kana": "かわず", "season": "spring", "category": "animal"} ], "author": "松尾芭蕉", "author_kana": "まつおばしょう", "author_birth_year": 1644, "author_death_year": 1694, "rights": {"text_status": "public_domain", "annotation_license": "cc-by-4.0"}, "annotation_status": "ai_generated" } ``` (Illustrative; the actual record corresponding to "古池や" has a different ID.) ### Loading ```python from datasets import load_dataset ds = load_dataset("Rootport/nz-traditional-haiku") print(ds) ``` Or, with raw file access: ```python import json with open("haiku_master.jsonl", "r", encoding="utf-8") as f: records = [json.loads(line) for line in f] ``` --- ## Authors | Slug | Name (kanji) | Name (kana) | Birth–Death | Count | Style | |---|---|---|---|---:|---| | `matsuo_basho` | 松尾芭蕉 | まつおばしょう | 1644–1694 | 4,231 | classical / lyrical | | `yosa_buson` | 与謝蕪村 | よさぶそん | 1716–1784 | 867 | classical / ornate / lyrical | | `kobayashi_issa` | 小林一茶 | こばやしいっさ | 1763–1828 | 1,098 | classical / colloquial / humorous | | `masaoka_shiki` | 正岡子規 | まさおかしき | 1867–1902 | 3,745 | classical / observational / plain | | `takahama_kyoshi` | 高浜虚子 | たかはまきょし | 1874–1959 | 1,736 | classical / observational | | `taneda_santoka` | 種田山頭火 | たねださんとうか | 1882–1940 | 700 | free verse / colloquial / lyrical | | `ozaki_hosai` | 尾崎放哉 | おざきほうさい | 1885–1926 | 410 | free verse / colloquial / philosophical | ## Sources Source texts are compiled from public-domain editions: - **Matsuo Bashō**: 芭蕉俳句全集 (Yamanashi Prefectural University) - **Yosa Buson**: ネット歳時記「きごさい」 - **Kobayashi Issa**: JANIS 一茶発句全集 + user-curated popular haiku - **Masaoka Shiki**: 青空文庫 (Aozora Bunko) - **Takahama Kyoshi**: 青空文庫 - **Taneda Santōka**: 青空文庫・草木塔 - **Ozaki Hōsai**: 青空文庫・尾崎放哉選句集 All source texts are out of copyright in Japan (latest death year 1959 + 70-year term). --- ## License and Rights - **Original haiku text** (the `text` field, in source orthography): **Public Domain**. All seven poets died before 1960; Japanese copyright has lapsed. - **Annotations** (readings, segments, mora counts, kigo classification, seasonal tags, scene labels, and all other added metadata fields): **Creative Commons Attribution 4.0 International (CC BY 4.0)**. The schema's `rights` field on each record encodes this dual structure explicitly. ## Annotation Methodology - Readings (`kana`, `segments`) were produced by **direct LLM annotation (Claude)**, not by general-purpose morphological analyzers such as pykakasi. An early experiment using pykakasi on a 410-record subset produced numerous systematic errors (e.g. 人 → にん, 月 → がつ, 哉 → や) that would have cost more in manual correction than the additional data was worth. - For fixed-form poets (Bashō, Buson, Issa, Shiki, Kyoshi, early Santōka, early Hōsai), readings that fit the 5-7-5 pattern are preferred when ambiguous (e.g. 庵 = いお rather than いおり when the alternative would overflow the segment). - For free-verse poets (later Santōka, later Hōsai), natural readings are preferred without forcing the 5-7-5 fit. - Seasonal words (kigo) carry **both** the source-text spelling AND the standard saijiki headword form (`canonical_word` / `canonical_kana`). This dual form lets downstream models handle both descriptive prompts (「寒い夜」) and formal saijiki prompts (「時雨」) robustly. --- ## Limitations and Known Issues This is a **preliminary release**. Specific caveats: 1. **~22% of records (~2,800) carry `annotation_status = "needs_review"`** and have not yet been human-reviewed. Many are likely deletion candidates: section headers, biographical metadata, editorial notes, quoted poems by other poets, and similar non-haiku content that survived from the source publications. A reviewer-input pass is planned before the v1.0 release. 2. **Bashō hiragana-reading rows**: roughly 1,180 records (mostly with IDs ≥ `matsuo_basho_002500`) are pure-hiragana phonetic versions of adjacent kanji-form haiku, retained from the source publication. They are intentionally preserved as separate records because hiragana-only training data is useful for downstream generation models. Be aware that pairs of records may share semantic content. 3. **Source quality varies by author.** The Shiki collection (青空文庫) includes substantial non-haiku material from the poet's late-period literary criticism — quoted classical poems by other authors, prose fragments, table-of-contents headers. NR rates in those sections exceed 95% and most entries are deletion candidates. 4. **Mood labels are placeholders.** `mood = ["unknown"]` / `mood_ja = ["不明"]` for every record. A separate downstream LLM annotation pass is planned. 5. **Scene labels are partial.** `scene_ja` (Japanese) is populated; `scene` (English) is placeholder pending downstream translation. --- ## Citation ```bibtex @dataset{nz_traditional_haiku_2026, title = {Traditional Japanese Haiku Dataset (Edo--Meiji Era)}, author = {Rootport}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/Rootport/nz-traditional-haiku}, note = {Pre-release draft. 12,787 records from 7 public-domain poets.} } ```
提供机构:
Rootport
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作