omneity-labs/spellbench
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/omneity-labs/spellbench
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- multilingual
- en
- fr
- de
- es
- it
- pt
- tr
- pl
- cs
- ro
- sv
- no
- da
- nl
- fi
- ru
- ar
- hi
- ko
- ja
- sw
- id
- yo
- bg
- mr
- el
- hy
- ka
- th
- he
license: apache-2.0
task_categories:
- text-generation
tags:
- linguistic
- character-level
- spelling
- evaluation
- benchmark
pretty_name: "SpellBench"
size_categories:
- 10K<n<100K
---
# SpellBench — Linguistic Character-Level Evaluation Benchmark
**SpellBench** is a benchmark for evaluating how well language models handle character-level and word-level linguistic operations. It includes **29,700** items across diverse tasks, with granular tracking of **language** and **script** for each sample.
Most LLMs operate at the token level and struggle with tasks that require reasoning about individual characters — spelling, reversing, counting letters, etc. SpellBench provides a standardized way to measure this across diverse scripts, including cases with diacritics (tashkeel), tonal marks, and transliterations.
## Quick Start
```python
from datasets import load_dataset
# Load the test split for evaluation
test_ds = load_dataset("omneity-labs/spellbench", split="test")
# Load the train split for few-shot prompting or fine-tuning
train_ds = load_dataset("omneity-labs/spellbench", split="train")
```
## Dataset Splits
| Split | Items | Words/task/language |
|-------|-------|---------------------|
| **train** | 14,850 | 50 |
| **test** | 14,850 | 50 |
| **total** | **29,700** | — |
The splits are generated by partitioning the word pool for each language 50/50 **before** generating task items, so a word that appears in a test example is never seen in any training example for that language.
## Tasks
SpellBench currently contains **14 implemented tasks** across two categories.
### Word-Level (9 tasks)
| Task | Description | Example Input | Example Expected | Script Restriction |
|------|-------------|---------------|------------------|-----------|
| `spell` | Spell letter by letter with dashes | `hello` | `h-e-l-l-o` | all scripts |
| `reverse` | Reverse the characters | `hello` | `olleh` | all scripts |
| `word_length` | Count characters | `hello` | `5` | all scripts |
| `first_letter` | First character | `hello` | `h` | all scripts |
| `last_letter` | Last character | `hello` | `o` | all scripts |
| `is_palindrome` | Check if palindrome | `racecar` | `true` | all scripts |
| `vowel_count` | Count Latin vowels (a,e,i,o,u) | `hello` | `2` | Latin only |
| `consonant_count` | Count Latin consonants | `hello` | `3` | Latin only |
| `remove_vowels` | Strip Latin vowels | `hello` | `hll` | Latin only |
### Sentence-Level (5 tasks)
| Task | Description | Example Input | Example Expected |
|------|-------------|---------------|------------------|
| `word_count` | Count words | `the quick brown fox` | `4` |
| `sentence_reverse` | Reverse word order | `the quick brown fox` | `fox brown quick the` |
| `longest_word` | Find longest word | `the quick brown fox` | `quick` |
| `shortest_word` | Find shortest word | `the quick brown fox` | `the` |
| `alphabetical_order` | Sort words A→Z | `the quick brown fox` | `brown fox quick the` |
## Item Format
Each item in `test.jsonl` is a JSON object:
```json
{
"id": "test_spell_00042",
"task": "spell",
"input": "strawberry",
"expected": "s-t-r-a-w-b-e-r-r-y",
"metadata": {
"language": "English",
"script": "Latin",
"word": "strawberry"
}
}
```
- **`id`**: Unique identifier
- **`task`**: Task name
- **`input`**: The input to present to the model
- **`expected`**: The ground-truth answer
- **`metadata`**: Includes `language` and `script` for per-lang analysis.
## Evaluation Heuristics
To handle the conversational nature of LLMs, the provided evaluation scripts (`run_eval_transformers.py`, `run_eval_openai.py`) use a multi-stage **Extraction & Normalization** pipeline rather than simple string matching:
### 1. Answer Extraction
Before comparison, the scripts attempt to isolate the model's intent:
- **JSON Parsing**: If the model outputs a JSON block, it extracts the value from the `"answer"` or `"result"` keys.
- **Numeric Selection**: For counting tasks (e.g., `word_length`), it extracts the **last** standalone number in the response to ignore conversational filler.
- **Boolean Mapping**: Detects "true/yes" or "false/no" and maps them to canonical `true` or `false`.
- **Character Isolation**: For letter-based tasks, it looks for characters inside single/double quotes or standalone characters.
- **Order Mapping**: For `compare_lengths`, it maps linguistic descriptions like "the second word" to canonical markers (`second`).
- **Fallback**: Defaults to the **last non-empty line** of the response.
### 2. Normalization
Once an answer is extracted, it is normalized to ensure fairness:
- **Case Insensitivity**: All comparisons are case-normalized.
- **Collection Normalization**: For tasks returning lists (e.g., `unique_letters`), the script sorts the characters and ignores separator differences (commas vs spaces) to compare set content rather than formatting.
- **Whitespace Stripping**: Strict stripping of leading/trailing whitespace.
## Prompts
Reference prompt templates for each task are in `data/prompts.json`. These are suggestions — you can use your own prompts. Each task has 3 template variations:
```json
{
"spell": {
"description": "Spell a word letter by letter, separated by dashes",
"prompts": [
"Spell the word '{input}' letter by letter, separating each letter with a dash.",
"Break the word '{input}' into individual letters separated by hyphens.",
"List each character in '{input}' one by one, using dashes between them."
],
"expected_format": "h-e-l-l-o"
}
}
```
## Language & Script Coverage
SpellBench is designed to be a truly multilingual benchmark, moving beyond English-centric character evaluation. It currently covers **28,600** items across **24+ language-script combinations**, including:
### Supported Languages & Scripts
| Script | Languages |
| :--- | :--- |
| **Latin** | English, French, German, Spanish, Indonesian, Swahili, Yoruba |
| **Arabic** | Arabic (Modern Standard), Arabic (MSA with Tashkeel/Diacritics), Persian (Farsi) |
| **Arabizi** | Moroccan Arabizi, Egyptian Arabizi (incorporating numbers like 2, 3, 5, 7, 9) |
| **Romanized** | Romanized Russian, Romanized Japanese (Hepburn) |
| **Cyrillic** | Russian, Bulgarian |
| **Devanagari** | Hindi, Marathi |
| **Other Scripts** | Greek, Armenian, Georgian, Korean (Hangul), Thai, Hebrew |
### Special Features
- **Tashkeel (Diacritics)**: A dedicated Arabic split where every word includes full vocalization. This tests whether models can "see" through diacritics to identify core letters or accurately count the diacritics themselves.
- **Arabizi (transliteration with numbers)**: Modern Arabic dialects written in Latin script using numerals to represent sounds not found in English (e.g., `3` for 'ayn, `7` for ha). This is a unique challenge for tokenizers.
- **Tonal Marks & Diacritics**: Extensive coverage of Yoruba (tonal marks) and European accents (French, German, Spanish, etc.).
- **Romanization**: Tests how models handle transliterated concepts (Russian/Japanese) which often have ambiguous tokenization boundaries.
### Per-Language Analysis
Every sample in the dataset includes `language` and `script` metadata. We recommend reporting accuracy scores broken down by script to identify where models may have "blind spots" due to their training data or vocabulary constraints.
## Running the Evaluation
### With a HuggingFace Transformers model
```bash
python run_eval_transformers.py \
--model "Qwen/Qwen3.5-0.8B" \
--tasks spell,reverse,word_length \
--output results.json
```
### With an OpenAI-compatible API
```bash
python run_eval_openai.py \
--base-url "https://api.openai.com/v1" \
--model "gpt-5" \
--tasks spell,reverse \
--output results.json
```
See the runner scripts for full options.
## Regenerating the Dataset
The dataset is deterministic (seed=42). To regenerate:
```bash
python generate.py # generates data/
python generate.py --check # verify only
```
## Citation
```bibtex
@misc{spellbench2026,
title={SpellBench: Can LLMs spell? A Character-Level Linguistic Evaluation Benchmark},
author={Omar Kamali},
year={2026},
url={https://github.com/omneity-labs/spellbench}
}
```
## License
Code is MIT licensed. Data is CC BY-SA licensed.
提供机构:
omneity-labs



