five

MarcoDotIO/jpn-bench

收藏
Hugging Face2026-04-25 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/MarcoDotIO/jpn-bench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - ja pretty_name: JPN-Bench tags: - japanese - benchmark - tokenizer-evaluation - language-model-evaluation - kotodamalm task_categories: - text-generation - translation - question-answering configs: - config_name: literacy data_files: - split: dev path: data/literacy_items.jsonl - config_name: seed data_files: - split: seed path: data/seed_items.jsonl - config_name: benchmark_sources data_files: - split: benchmark path: data/source_materials/benchmark_sources.jsonl - config_name: benchmark_text data_files: - split: benchmark path: data/source_materials/benchmark_text.jsonl --- # JPN-Bench JPN-Bench is a Japanese literacy benchmark for tokenizer evaluation and future Japanese LLM evaluation. This public release contains a small curated tokenizer-literacy dev set plus benchmark-lane source material manifests kept separate from tokenizer training material. This dataset is grouped with the KotodamaLM tokenizer work in the Hugging Face collection "KotodamaLM Japanese Language Infrastructure". ## Files - `data/literacy_items.jsonl`: 60 tokenizer-literacy items balanced across `n5`, `n4`, and `above_n4`. - `data/seed_items.jsonl`: 3 schema smoke-test items. - `data/source_materials/benchmark_sources.jsonl`: source-file manifest for benchmark candidate material. - `data/source_materials/benchmark_text.jsonl`: extracted benchmark-lane text. - `spec.md`: benchmark schema and scoring intent. - `scale_plan.md`: roadmap toward Lite, Verified, Full, SWE-sized, and Hidden tiers. - `scale_targets.json`: machine-readable target counts. - `source_separation.md`: rules that keep benchmark and tokenizer-training material apart. ## Loading ```python from datasets import load_dataset literacy = load_dataset( "MarcoDotIO/jpn-bench", "literacy", split="dev", ) ``` For local or explicit loading: ```python from datasets import load_dataset dataset = load_dataset( "json", data_files={ "literacy": "data/literacy_items.jsonl", "seed": "data/seed_items.jsonl", "benchmark_sources": "data/source_materials/benchmark_sources.jsonl", "benchmark_text": "data/source_materials/benchmark_text.jsonl", }, ) ``` ## Separation Policy Do not train tokenizers or LLMs on the public JPN-Bench files if you intend to report JPN-Bench scores. The paired KotodamaLM tokenizer was trained from the training lane and then aggressively filtered to remove any released JPN-Bench target surfaces. The source split is file-level and deterministic. Whole source files are assigned to either the tokenizer-training lane or benchmark lane, preventing neighboring sentences from the same source file from landing in both places. ## Current Scope The public literacy set is intentionally small: - 60 dev items - 209 tokenization target surfaces - 20 items each for `n5`, `n4`, and `above_n4` The scale plan targets: - 300-item Lite - 500-item Verified - 2,500-item Full - optional 2,294-item SWE-sized Full slice - private Hidden tier for saturation checks ## Sources And Attribution Benchmark-lane source material is derived from CC BY 4.0 Japanese corpora: - NINJAL Parsed Corpus of Modern Japanese (NPCMJ) - Kainoki / Open National Corpus of Japanese related source materials See the JSONL source manifests for per-source URL, license, and citation fields. ## Citation ```bibtex @dataset{jpn_bench_2026, title = {JPN-Bench}, author = {MarcoDotIO}, year = {2026}, publisher = {Hugging Face}, license = {CC-BY-4.0} } ```
提供机构:
MarcoDotIO
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作