Rootport/nz-traditional-haiku
收藏Hugging Face2026-05-26 更新2026-05-31 收录
下载链接:
https://hf-mirror.com/datasets/Rootport/nz-traditional-haiku
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ja
license: cc-by-4.0
size_categories:
- 10K<n<100K
task_categories:
- text-generation
tags:
- haiku
- japanese
- poetry
- classical-japanese
pretty_name: Traditional Japanese Haiku Dataset (Edo-Meiji Era)
---
# Traditional Japanese Haiku Dataset (Edo–Meiji Era)
> ⚠️ **Work in Progress — Pre-Release Draft**
>
> This dataset is **not yet ready for general use**. It is being uploaded primarily as a personal backup snapshot during active development. Schema, annotations, and documentation may change without notice. Approximately **22% of records (~2,800)** are still flagged `annotation_status = "needs_review"` and have **not yet undergone human review**.
>
> If you arrived here unexpectedly, please check back later — this notice will be removed when the dataset is ready for downstream use.
>
> このデータセットは **まだ製作中** です。現時点では作者本人のバックアップを兼ねた公開であり、スキーマ・アノテーション・ドキュメントは予告なく変更される可能性があります。約 22%(2,800 件)のレコードに `annotation_status = "needs_review"` が付いており、**人間レビュー未反映** の状態です。
---
## Dataset Summary
A machine-readable corpus of **12,787 classical Japanese haiku** from seven public-domain poets spanning the late Edo to early Showa eras (1644–1959). Each record carries:
- Original source-text spelling (`text`) preserving historical orthography
- Modern hiragana reading (`kana`) and 5-7-5 segmentation (`segments`, `mora_counts`)
- Seasonal classification (`season`) and **dual-form kigo annotation** keeping both the source-text spelling AND the standard saijiki headword
- Form metadata (`form`, `is_regular_575`) distinguishing fixed-form haiku from free verse (Santōka, Hōsai)
- Author metadata with birth/death years and per-poet style tags
Designed primarily as training data for **haiku generation models** — including diffusion-based approaches — where the prompt input may vary from formal saijiki headwords (`時雨` / `しぐれ`) to descriptive natural language (`寒い夜`).
## 概要
江戸後期から昭和初期(1644–1959)までの著作権切れ俳人 **7 名・計 12,787 句** をアノテーションした機械可読データセット。各レコードに現代仮名読み・モーラ分割・季語(本文表記 + 歳時記見出し語の二重保持)・作者メタを付与。**俳句生成モデル(Diffusion 系含む)** の訓練データを主用途として設計。
---
## Dataset Structure
### Data Fields
| Field | Type | Description |
|---|---|---|
| `id` | string | Unique ID, format `author_slug_NNNNNN` (e.g. `matsuo_basho_000001`) |
| `text` | string | Original haiku text, preserving source orthography (旧仮名遣い含む) |
| `text_spaced` | string | Text with half-width space at 5-7-5 boundaries |
| `kana` | string | Modern hiragana reading, segments joined by half-width spaces |
| `normalized_kana` | string | Normalized hiragana for training (typically identical to `kana`) |
| `segments` | list[string] | Reading split into 5-7-5 segments (1–3 elements for free verse) |
| `mora_counts` | list[int] | Mora count per segment; sums to `total_mora_count` |
| `total_mora_count` | int | Sum of `mora_counts` |
| `form` | string | One of `5-7-5`, `5-7-5-extra`, `free`, `unknown` |
| `is_regular_575` | bool | True iff `mora_counts == [5, 7, 5]` |
| `season` | string | One of `spring`, `summer`, `autumn`, `winter`, `new_year`, `nonseasonal`, `unknown` |
| `kigo` | list[dict] | Seasonal words; each entry has `word`, `kana`, `canonical_word`, `canonical_kana`, `season`, `category` |
| `author` | string | Poet name in kanji |
| `author_kana` | string | Poet name in hiragana |
| `author_birth_year` | int | |
| `author_death_year` | int | |
| `source` | string | Source publication |
| `mood`, `mood_ja` | list[string] | Mood labels (currently placeholder `["unknown"]` / `["不明"]`) |
| `scene`, `scene_ja` | list[string] | Scene/motif labels; `scene_ja` populated, `scene` placeholder |
| `style`, `style_ja` | list[string] | Per-author default style tags |
| `rights` | dict | `{text_status: "public_domain", annotation_license: "cc-by-4.0"}` |
| `annotation_status` | string | One of `ai_generated`, `needs_review`, `verified` |
### Mora Counting
- Diphthong glides (ょ・ゅ・ゃ・ぁ etc.) count as **0 mora**.
- Geminate (っ), syllabic n (ん), and long mark (ー) each count as **1 mora**.
- Example: 「ちょうちん」 = ち + (ょ) + う + ち + ん = **4 mora**.
### Example Instance
```json
{
"id": "matsuo_basho_001234",
"text": "古池や蛙飛び込む水の音",
"text_spaced": "古池や 蛙飛び込む 水の音",
"kana": "ふるいけや かわずとびこむ みずのおと",
"normalized_kana": "ふるいけや かわずとびこむ みずのおと",
"segments": ["ふるいけや", "かわずとびこむ", "みずのおと"],
"mora_counts": [5, 7, 5],
"total_mora_count": 17,
"form": "5-7-5",
"is_regular_575": true,
"season": "spring",
"kigo": [
{"word": "蛙", "kana": "かわず",
"canonical_word": "蛙", "canonical_kana": "かわず",
"season": "spring", "category": "animal"}
],
"author": "松尾芭蕉",
"author_kana": "まつおばしょう",
"author_birth_year": 1644,
"author_death_year": 1694,
"rights": {"text_status": "public_domain", "annotation_license": "cc-by-4.0"},
"annotation_status": "ai_generated"
}
```
(Illustrative; the actual record corresponding to "古池や" has a different ID.)
### Loading
```python
from datasets import load_dataset
ds = load_dataset("Rootport/nz-traditional-haiku")
print(ds)
```
Or, with raw file access:
```python
import json
with open("haiku_master.jsonl", "r", encoding="utf-8") as f:
records = [json.loads(line) for line in f]
```
---
## Authors
| Slug | Name (kanji) | Name (kana) | Birth–Death | Count | Style |
|---|---|---|---|---:|---|
| `matsuo_basho` | 松尾芭蕉 | まつおばしょう | 1644–1694 | 4,231 | classical / lyrical |
| `yosa_buson` | 与謝蕪村 | よさぶそん | 1716–1784 | 867 | classical / ornate / lyrical |
| `kobayashi_issa` | 小林一茶 | こばやしいっさ | 1763–1828 | 1,098 | classical / colloquial / humorous |
| `masaoka_shiki` | 正岡子規 | まさおかしき | 1867–1902 | 3,745 | classical / observational / plain |
| `takahama_kyoshi` | 高浜虚子 | たかはまきょし | 1874–1959 | 1,736 | classical / observational |
| `taneda_santoka` | 種田山頭火 | たねださんとうか | 1882–1940 | 700 | free verse / colloquial / lyrical |
| `ozaki_hosai` | 尾崎放哉 | おざきほうさい | 1885–1926 | 410 | free verse / colloquial / philosophical |
## Sources
Source texts are compiled from public-domain editions:
- **Matsuo Bashō**: 芭蕉俳句全集 (Yamanashi Prefectural University)
- **Yosa Buson**: ネット歳時記「きごさい」
- **Kobayashi Issa**: JANIS 一茶発句全集 + user-curated popular haiku
- **Masaoka Shiki**: 青空文庫 (Aozora Bunko)
- **Takahama Kyoshi**: 青空文庫
- **Taneda Santōka**: 青空文庫・草木塔
- **Ozaki Hōsai**: 青空文庫・尾崎放哉選句集
All source texts are out of copyright in Japan (latest death year 1959 + 70-year term).
---
## License and Rights
- **Original haiku text** (the `text` field, in source orthography): **Public Domain**. All seven poets died before 1960; Japanese copyright has lapsed.
- **Annotations** (readings, segments, mora counts, kigo classification, seasonal tags, scene labels, and all other added metadata fields): **Creative Commons Attribution 4.0 International (CC BY 4.0)**.
The schema's `rights` field on each record encodes this dual structure explicitly.
## Annotation Methodology
- Readings (`kana`, `segments`) were produced by **direct LLM annotation (Claude)**, not by general-purpose morphological analyzers such as pykakasi. An early experiment using pykakasi on a 410-record subset produced numerous systematic errors (e.g. 人 → にん, 月 → がつ, 哉 → や) that would have cost more in manual correction than the additional data was worth.
- For fixed-form poets (Bashō, Buson, Issa, Shiki, Kyoshi, early Santōka, early Hōsai), readings that fit the 5-7-5 pattern are preferred when ambiguous (e.g. 庵 = いお rather than いおり when the alternative would overflow the segment).
- For free-verse poets (later Santōka, later Hōsai), natural readings are preferred without forcing the 5-7-5 fit.
- Seasonal words (kigo) carry **both** the source-text spelling AND the standard saijiki headword form (`canonical_word` / `canonical_kana`). This dual form lets downstream models handle both descriptive prompts (「寒い夜」) and formal saijiki prompts (「時雨」) robustly.
---
## Limitations and Known Issues
This is a **preliminary release**. Specific caveats:
1. **~22% of records (~2,800) carry `annotation_status = "needs_review"`** and have not yet been human-reviewed. Many are likely deletion candidates: section headers, biographical metadata, editorial notes, quoted poems by other poets, and similar non-haiku content that survived from the source publications. A reviewer-input pass is planned before the v1.0 release.
2. **Bashō hiragana-reading rows**: roughly 1,180 records (mostly with IDs ≥ `matsuo_basho_002500`) are pure-hiragana phonetic versions of adjacent kanji-form haiku, retained from the source publication. They are intentionally preserved as separate records because hiragana-only training data is useful for downstream generation models. Be aware that pairs of records may share semantic content.
3. **Source quality varies by author.** The Shiki collection (青空文庫) includes substantial non-haiku material from the poet's late-period literary criticism — quoted classical poems by other authors, prose fragments, table-of-contents headers. NR rates in those sections exceed 95% and most entries are deletion candidates.
4. **Mood labels are placeholders.** `mood = ["unknown"]` / `mood_ja = ["不明"]` for every record. A separate downstream LLM annotation pass is planned.
5. **Scene labels are partial.** `scene_ja` (Japanese) is populated; `scene` (English) is placeholder pending downstream translation.
---
## Citation
```bibtex
@dataset{nz_traditional_haiku_2026,
title = {Traditional Japanese Haiku Dataset (Edo--Meiji Era)},
author = {Rootport},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/Rootport/nz-traditional-haiku},
note = {Pre-release draft. 12,787 records from 7 public-domain poets.}
}
```
提供机构:
Rootport



