omneity-labs/spellbench

Name: omneity-labs/spellbench
Creator: omneity-labs
Published: 2026-04-16 22:11:29
License: 暂无描述

Hugging Face2026-04-16 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/omneity-labs/spellbench

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - multilingual - en - fr - de - es - it - pt - tr - pl - cs - ro - sv - no - da - nl - fi - ru - ar - hi - ko - ja - sw - id - yo - bg - mr - el - hy - ka - th - he license: apache-2.0 task_categories: - text-generation tags: - linguistic - character-level - spelling - evaluation - benchmark pretty_name: "SpellBench" size_categories: - 10K<n<100K --- # SpellBench — Linguistic Character-Level Evaluation Benchmark **SpellBench** is a benchmark for evaluating how well language models handle character-level and word-level linguistic operations. It includes **29,700** items across diverse tasks, with granular tracking of **language** and **script** for each sample. Most LLMs operate at the token level and struggle with tasks that require reasoning about individual characters — spelling, reversing, counting letters, etc. SpellBench provides a standardized way to measure this across diverse scripts, including cases with diacritics (tashkeel), tonal marks, and transliterations. ## Quick Start ```python from datasets import load_dataset # Load the test split for evaluation test_ds = load_dataset("omneity-labs/spellbench", split="test") # Load the train split for few-shot prompting or fine-tuning train_ds = load_dataset("omneity-labs/spellbench", split="train") ``` ## Dataset Splits | Split | Items | Words/task/language | |-------|-------|---------------------| | **train** | 14,850 | 50 | | **test** | 14,850 | 50 | | **total** | **29,700** | — | The splits are generated by partitioning the word pool for each language 50/50 **before** generating task items, so a word that appears in a test example is never seen in any training example for that language. ## Tasks SpellBench currently contains **14 implemented tasks** across two categories. ### Word-Level (9 tasks) | Task | Description | Example Input | Example Expected | Script Restriction | |------|-------------|---------------|------------------|-----------| | `spell` | Spell letter by letter with dashes | `hello` | `h-e-l-l-o` | all scripts | | `reverse` | Reverse the characters | `hello` | `olleh` | all scripts | | `word_length` | Count characters | `hello` | `5` | all scripts | | `first_letter` | First character | `hello` | `h` | all scripts | | `last_letter` | Last character | `hello` | `o` | all scripts | | `is_palindrome` | Check if palindrome | `racecar` | `true` | all scripts | | `vowel_count` | Count Latin vowels (a,e,i,o,u) | `hello` | `2` | Latin only | | `consonant_count` | Count Latin consonants | `hello` | `3` | Latin only | | `remove_vowels` | Strip Latin vowels | `hello` | `hll` | Latin only | ### Sentence-Level (5 tasks) | Task | Description | Example Input | Example Expected | |------|-------------|---------------|------------------| | `word_count` | Count words | `the quick brown fox` | `4` | | `sentence_reverse` | Reverse word order | `the quick brown fox` | `fox brown quick the` | | `longest_word` | Find longest word | `the quick brown fox` | `quick` | | `shortest_word` | Find shortest word | `the quick brown fox` | `the` | | `alphabetical_order` | Sort words A→Z | `the quick brown fox` | `brown fox quick the` | ## Item Format Each item in `test.jsonl` is a JSON object: ```json { "id": "test_spell_00042", "task": "spell", "input": "strawberry", "expected": "s-t-r-a-w-b-e-r-r-y", "metadata": { "language": "English", "script": "Latin", "word": "strawberry" } } ``` - **`id`**: Unique identifier - **`task`**: Task name - **`input`**: The input to present to the model - **`expected`**: The ground-truth answer - **`metadata`**: Includes `language` and `script` for per-lang analysis. ## Evaluation Heuristics To handle the conversational nature of LLMs, the provided evaluation scripts (`run_eval_transformers.py`, `run_eval_openai.py`) use a multi-stage **Extraction & Normalization** pipeline rather than simple string matching: ### 1. Answer Extraction Before comparison, the scripts attempt to isolate the model's intent: - **JSON Parsing**: If the model outputs a JSON block, it extracts the value from the `"answer"` or `"result"` keys. - **Numeric Selection**: For counting tasks (e.g., `word_length`), it extracts the **last** standalone number in the response to ignore conversational filler. - **Boolean Mapping**: Detects "true/yes" or "false/no" and maps them to canonical `true` or `false`. - **Character Isolation**: For letter-based tasks, it looks for characters inside single/double quotes or standalone characters. - **Order Mapping**: For `compare_lengths`, it maps linguistic descriptions like "the second word" to canonical markers (`second`). - **Fallback**: Defaults to the **last non-empty line** of the response. ### 2. Normalization Once an answer is extracted, it is normalized to ensure fairness: - **Case Insensitivity**: All comparisons are case-normalized. - **Collection Normalization**: For tasks returning lists (e.g., `unique_letters`), the script sorts the characters and ignores separator differences (commas vs spaces) to compare set content rather than formatting. - **Whitespace Stripping**: Strict stripping of leading/trailing whitespace. ## Prompts Reference prompt templates for each task are in `data/prompts.json`. These are suggestions — you can use your own prompts. Each task has 3 template variations: ```json { "spell": { "description": "Spell a word letter by letter, separated by dashes", "prompts": [ "Spell the word '{input}' letter by letter, separating each letter with a dash.", "Break the word '{input}' into individual letters separated by hyphens.", "List each character in '{input}' one by one, using dashes between them." ], "expected_format": "h-e-l-l-o" } } ``` ## Language & Script Coverage SpellBench is designed to be a truly multilingual benchmark, moving beyond English-centric character evaluation. It currently covers **28,600** items across **24+ language-script combinations**, including: ### Supported Languages & Scripts | Script | Languages | | :--- | :--- | | **Latin** | English, French, German, Spanish, Indonesian, Swahili, Yoruba | | **Arabic** | Arabic (Modern Standard), Arabic (MSA with Tashkeel/Diacritics), Persian (Farsi) | | **Arabizi** | Moroccan Arabizi, Egyptian Arabizi (incorporating numbers like 2, 3, 5, 7, 9) | | **Romanized** | Romanized Russian, Romanized Japanese (Hepburn) | | **Cyrillic** | Russian, Bulgarian | | **Devanagari** | Hindi, Marathi | | **Other Scripts** | Greek, Armenian, Georgian, Korean (Hangul), Thai, Hebrew | ### Special Features - **Tashkeel (Diacritics)**: A dedicated Arabic split where every word includes full vocalization. This tests whether models can "see" through diacritics to identify core letters or accurately count the diacritics themselves. - **Arabizi (transliteration with numbers)**: Modern Arabic dialects written in Latin script using numerals to represent sounds not found in English (e.g., `3` for 'ayn, `7` for ha). This is a unique challenge for tokenizers. - **Tonal Marks & Diacritics**: Extensive coverage of Yoruba (tonal marks) and European accents (French, German, Spanish, etc.). - **Romanization**: Tests how models handle transliterated concepts (Russian/Japanese) which often have ambiguous tokenization boundaries. ### Per-Language Analysis Every sample in the dataset includes `language` and `script` metadata. We recommend reporting accuracy scores broken down by script to identify where models may have "blind spots" due to their training data or vocabulary constraints. ## Running the Evaluation ### With a HuggingFace Transformers model ```bash python run_eval_transformers.py \ --model "Qwen/Qwen3.5-0.8B" \ --tasks spell,reverse,word_length \ --output results.json ``` ### With an OpenAI-compatible API ```bash python run_eval_openai.py \ --base-url "https://api.openai.com/v1" \ --model "gpt-5" \ --tasks spell,reverse \ --output results.json ``` See the runner scripts for full options. ## Regenerating the Dataset The dataset is deterministic (seed=42). To regenerate: ```bash python generate.py # generates data/ python generate.py --check # verify only ``` ## Citation ```bibtex @misc{spellbench2026, title={SpellBench: Can LLMs spell? A Character-Level Linguistic Evaluation Benchmark}, author={Omar Kamali}, year={2026}, url={https://github.com/omneity-labs/spellbench} } ``` ## License Code is MIT licensed. Data is CC BY-SA licensed.

提供机构：

omneity-labs

5,000+

优质数据集

54 个

任务类型

进入经典数据集