five

katsukiono/kana-kanji-pairs

收藏
Hugging Face2026-01-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/katsukiono/kana-kanji-pairs
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ja license: - bsd-3-clause - cc-by-sa-4.0 - apache-2.0 task_categories: - text-generation tags: - japanese - kana - kanji - ime - input-method size_categories: - 1M<n<10M --- # kana-kanji-pairs Japanese kana-to-kanji conversion candidate dataset. ## Overview | Metric | Value | |--------|-------| | Total pairs | 1,124,675 | | File size | ~112MB | | Format | JSONL | ### Candidate Distribution | Candidates | Entries | % | |------------|---------|---| | n>=2 | 363,708 | 32.3% | | n>=5 | 40,929 | 3.6% | | n>=10 | 9,401 | 0.8% | | n>=20 | 2,448 | 0.2% | | n>=100 | 34 | <0.1% | | max | 259 | - | ## Data Sources | Source | Entries | Description | |--------|---------|-------------| | `mozc` | 753,628 | Google mozc dictionary | | `jmdict` | 221,228 | JMdict Japanese-Multilingual Dictionary | | `wikipedia_mecab` | 94,938 | Wikipedia text analyzed with MeCab+UniDic | | `sudachi` | 30,937 | SudachiDict core vocabulary | | `wikipedia_ruby` | 23,944 | Wikipedia ruby annotations | ## Data Format ### Main dataset (data/train.jsonl) ```json { "input": "かがく", "output": ["科学", "化学", "下顎", "価額"], "source": "wikipedia_mecab", "count": 4 } ``` ### Wikipedia dataset (wikipedia/*.jsonl) ```json { "input": "かがく", "output": ["科学", "化学", "下顎", "価額"], "source": "wikipedia_mecab", "count": 4, "frequencies": {"科学": 303774, "化学": 122431, ...} } ``` ### Fields | Field | Description | |-------|-------------| | `input` | Reading (hiragana, may include ・ゔヽヾゝゞ) | | `output` | Conversion candidates (ordered by frequency) | | `source` | Data source identifier | | `count` | Number of candidates | | `frequencies` | Occurrence counts (wikipedia/*.jsonl only) | ## Files ``` data/ └── train.jsonl # Full dataset (1,124,675 entries) wikipedia/ ├── mecab.jsonl # Wikipedia MeCab with frequencies (94,938 entries) └── ruby.jsonl # Wikipedia Ruby with frequencies (23,944 entries) old/ └── mozc_n10_20260102.jsonl # Legacy mozc-only data (753,628 entries, n<=10) ``` ## Usage ```python from datasets import load_dataset # Load full dataset dataset = load_dataset("katsukiono/kana-kanji-pairs") # Filter by source mozc_data = [x for x in dataset["train"] if x["source"] == "mozc"] wiki_data = [x for x in dataset["train"] if x["source"].startswith("wikipedia")] # Load Wikipedia with frequencies wiki_mecab = load_dataset("katsukiono/kana-kanji-pairs", data_files="wikipedia/mecab.jsonl") ``` ## Licenses This dataset combines data from multiple sources with different licenses: | Source | License | |--------|---------| | mozc | BSD 3-Clause | | jmdict | CC BY-SA 4.0 | | sudachi | Apache 2.0 | | wikipedia_mecab | CC BY-SA 4.0 | | wikipedia_ruby | CC BY-SA 4.0 | See [licenses/](licenses/) directory for full license texts. ### Terms of Use - Attribution required for CC BY-SA sources - Include copyright notices for BSD/Apache sources - ShareAlike: derivatives of CC BY-SA content must use same license ## Source Repositories - mozc: https://github.com/google/mozc - JMdict: https://www.edrdg.org/jmdict/j_jmdict.html - SudachiDict: https://github.com/WorksApplications/SudachiDict - Wikipedia: https://dumps.wikimedia.org/jawiki/ - MeCab: https://taku910.github.io/mecab/ - UniDic: https://clrd.ninjal.ac.jp/unidic/
提供机构:
katsukiono
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作