katsukiono/kana-kanji-pairs
收藏Hugging Face2026-01-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/katsukiono/kana-kanji-pairs
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ja
license:
- bsd-3-clause
- cc-by-sa-4.0
- apache-2.0
task_categories:
- text-generation
tags:
- japanese
- kana
- kanji
- ime
- input-method
size_categories:
- 1M<n<10M
---
# kana-kanji-pairs
Japanese kana-to-kanji conversion candidate dataset.
## Overview
| Metric | Value |
|--------|-------|
| Total pairs | 1,124,675 |
| File size | ~112MB |
| Format | JSONL |
### Candidate Distribution
| Candidates | Entries | % |
|------------|---------|---|
| n>=2 | 363,708 | 32.3% |
| n>=5 | 40,929 | 3.6% |
| n>=10 | 9,401 | 0.8% |
| n>=20 | 2,448 | 0.2% |
| n>=100 | 34 | <0.1% |
| max | 259 | - |
## Data Sources
| Source | Entries | Description |
|--------|---------|-------------|
| `mozc` | 753,628 | Google mozc dictionary |
| `jmdict` | 221,228 | JMdict Japanese-Multilingual Dictionary |
| `wikipedia_mecab` | 94,938 | Wikipedia text analyzed with MeCab+UniDic |
| `sudachi` | 30,937 | SudachiDict core vocabulary |
| `wikipedia_ruby` | 23,944 | Wikipedia ruby annotations |
## Data Format
### Main dataset (data/train.jsonl)
```json
{
"input": "かがく",
"output": ["科学", "化学", "下顎", "価額"],
"source": "wikipedia_mecab",
"count": 4
}
```
### Wikipedia dataset (wikipedia/*.jsonl)
```json
{
"input": "かがく",
"output": ["科学", "化学", "下顎", "価額"],
"source": "wikipedia_mecab",
"count": 4,
"frequencies": {"科学": 303774, "化学": 122431, ...}
}
```
### Fields
| Field | Description |
|-------|-------------|
| `input` | Reading (hiragana, may include ・ゔヽヾゝゞ) |
| `output` | Conversion candidates (ordered by frequency) |
| `source` | Data source identifier |
| `count` | Number of candidates |
| `frequencies` | Occurrence counts (wikipedia/*.jsonl only) |
## Files
```
data/
└── train.jsonl # Full dataset (1,124,675 entries)
wikipedia/
├── mecab.jsonl # Wikipedia MeCab with frequencies (94,938 entries)
└── ruby.jsonl # Wikipedia Ruby with frequencies (23,944 entries)
old/
└── mozc_n10_20260102.jsonl # Legacy mozc-only data (753,628 entries, n<=10)
```
## Usage
```python
from datasets import load_dataset
# Load full dataset
dataset = load_dataset("katsukiono/kana-kanji-pairs")
# Filter by source
mozc_data = [x for x in dataset["train"] if x["source"] == "mozc"]
wiki_data = [x for x in dataset["train"] if x["source"].startswith("wikipedia")]
# Load Wikipedia with frequencies
wiki_mecab = load_dataset("katsukiono/kana-kanji-pairs", data_files="wikipedia/mecab.jsonl")
```
## Licenses
This dataset combines data from multiple sources with different licenses:
| Source | License |
|--------|---------|
| mozc | BSD 3-Clause |
| jmdict | CC BY-SA 4.0 |
| sudachi | Apache 2.0 |
| wikipedia_mecab | CC BY-SA 4.0 |
| wikipedia_ruby | CC BY-SA 4.0 |
See [licenses/](licenses/) directory for full license texts.
### Terms of Use
- Attribution required for CC BY-SA sources
- Include copyright notices for BSD/Apache sources
- ShareAlike: derivatives of CC BY-SA content must use same license
## Source Repositories
- mozc: https://github.com/google/mozc
- JMdict: https://www.edrdg.org/jmdict/j_jmdict.html
- SudachiDict: https://github.com/WorksApplications/SudachiDict
- Wikipedia: https://dumps.wikimedia.org/jawiki/
- MeCab: https://taku910.github.io/mecab/
- UniDic: https://clrd.ninjal.ac.jp/unidic/
提供机构:
katsukiono



