MarcoDotIO/jpn-bench
收藏Hugging Face2026-04-25 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/MarcoDotIO/jpn-bench
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- ja
pretty_name: JPN-Bench
tags:
- japanese
- benchmark
- tokenizer-evaluation
- language-model-evaluation
- kotodamalm
task_categories:
- text-generation
- translation
- question-answering
configs:
- config_name: literacy
data_files:
- split: dev
path: data/literacy_items.jsonl
- config_name: seed
data_files:
- split: seed
path: data/seed_items.jsonl
- config_name: benchmark_sources
data_files:
- split: benchmark
path: data/source_materials/benchmark_sources.jsonl
- config_name: benchmark_text
data_files:
- split: benchmark
path: data/source_materials/benchmark_text.jsonl
---
# JPN-Bench
JPN-Bench is a Japanese literacy benchmark for tokenizer evaluation and future
Japanese LLM evaluation. This public release contains a small curated
tokenizer-literacy dev set plus benchmark-lane source material manifests kept
separate from tokenizer training material.
This dataset is grouped with the KotodamaLM tokenizer work in the Hugging Face
collection "KotodamaLM Japanese Language Infrastructure".
## Files
- `data/literacy_items.jsonl`: 60 tokenizer-literacy items balanced across
`n5`, `n4`, and `above_n4`.
- `data/seed_items.jsonl`: 3 schema smoke-test items.
- `data/source_materials/benchmark_sources.jsonl`: source-file manifest for
benchmark candidate material.
- `data/source_materials/benchmark_text.jsonl`: extracted benchmark-lane text.
- `spec.md`: benchmark schema and scoring intent.
- `scale_plan.md`: roadmap toward Lite, Verified, Full, SWE-sized, and Hidden
tiers.
- `scale_targets.json`: machine-readable target counts.
- `source_separation.md`: rules that keep benchmark and tokenizer-training
material apart.
## Loading
```python
from datasets import load_dataset
literacy = load_dataset(
"MarcoDotIO/jpn-bench",
"literacy",
split="dev",
)
```
For local or explicit loading:
```python
from datasets import load_dataset
dataset = load_dataset(
"json",
data_files={
"literacy": "data/literacy_items.jsonl",
"seed": "data/seed_items.jsonl",
"benchmark_sources": "data/source_materials/benchmark_sources.jsonl",
"benchmark_text": "data/source_materials/benchmark_text.jsonl",
},
)
```
## Separation Policy
Do not train tokenizers or LLMs on the public JPN-Bench files if you intend to
report JPN-Bench scores. The paired KotodamaLM tokenizer was trained from the
training lane and then aggressively filtered to remove any released JPN-Bench
target surfaces.
The source split is file-level and deterministic. Whole source files are
assigned to either the tokenizer-training lane or benchmark lane, preventing
neighboring sentences from the same source file from landing in both places.
## Current Scope
The public literacy set is intentionally small:
- 60 dev items
- 209 tokenization target surfaces
- 20 items each for `n5`, `n4`, and `above_n4`
The scale plan targets:
- 300-item Lite
- 500-item Verified
- 2,500-item Full
- optional 2,294-item SWE-sized Full slice
- private Hidden tier for saturation checks
## Sources And Attribution
Benchmark-lane source material is derived from CC BY 4.0 Japanese corpora:
- NINJAL Parsed Corpus of Modern Japanese (NPCMJ)
- Kainoki / Open National Corpus of Japanese related source materials
See the JSONL source manifests for per-source URL, license, and citation fields.
## Citation
```bibtex
@dataset{jpn_bench_2026,
title = {JPN-Bench},
author = {MarcoDotIO},
year = {2026},
publisher = {Hugging Face},
license = {CC-BY-4.0}
}
```
提供机构:
MarcoDotIO



