floatai/TKEval
收藏Hugging Face2024-12-11 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/floatai/TKEval
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text2text-generation
language:
- en
---
# Dataset Card for TKEval
## Contents
- [Dataset Description](#dataset-description)
- [Dataset Structure](#dataset-structure)
- [Dataset Splits](#data-splits)
- [Citation](#citation)
## Dataset Description
**_The curse of tokenization_**: Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens.
TKEval is an evalution benchmark for systematicly assessing the impact of _"The curse of tokenization"_ on language model performance.
- **Repository**: https://github.com/FloatAI/TKEval
- **Paper**: https://arxiv.org/pdf/2406.11687
## Dataset Structure
```
.
├── complex_problem_solving
│ ├── cycled_letters_all_data_0123_shots.json
│ ├── identify_math_theorems_all_data_0123_shots.json
│ └── word_unscrambling_all_data_0123_shots.json
├── token_structure_probing
│ ├── test
│ │ ├── multi_token_prob.common_substrings.all_data_0123_shots.json
│ │ ├── multi_token_prob.longest_common_subsequences.all_data_0123_shots.json
│ │ ├── multi_token_prob.longest_common_substrings.all_data_0123_shots.json
│ │ ├── token_struct_prob.char_case_conversion.all_data_0123_shots.json
│ │ ├── token_struct_prob.character_count.all_data_0123_shots.json
│ │ ├── token_struct_prob.nth_character.all_data_0123_shots.json
│ │ └── token_struct_prob.nth_character_from_end.all_data_0123_shots.json
│ └── train
│ ├── multi_token_prob.common_substrings.jsonl
│ ├── multi_token_prob.longest_common_subsequences.jsonl
│ ├── multi_token_prob.longest_common_substrings.jsonl
│ ├── token_struct_prob.char_case_conversion.jsonl
│ ├── token_struct_prob.character_count.jsonl
│ ├── token_struct_prob.nth_character.jsonl
│ └── token_struct_prob.nth_character_from_end.jsonl
└── typographical_variation
├── data.typo.char.noise
│ ├── ngram_2
│ ├── ngram_3
│ └── ngram_5
├── data.typo.char.permute
│ ├── ngram_2
│ ├── ngram_3
│ └── ngram_5
├── data.typo.token.noise
│ ├── llama3
│ └── mistral
└── data.typo.token.permute
├── llama3
└── mistral
```
## Data Splits
<table>
<tr>
<th>Main Task</th>
<th>Sub Task</th>
<th>Train</th>
<th>Test</th>
</tr>
<tr>
<td rowspan="3">Complex Problem Solving</td>
<td>Cycled Letters in Word</td>
<td>-</td>
<td>20,975</td>
</tr>
<tr>
<td>Word Unscrambling</td>
<td>-</td>
<td>8,917</td>
</tr>
<tr>
<td>Identify Math Theorems</td>
<td>-</td>
<td>53</td>
</tr>
<tr>
<td rowspan="7">Token Structure Probe</td>
<td>Character Count</td>
<td>20,775</td>
<td>200</td>
</tr>
<tr>
<td>N-th Character</td>
<td>31,241</td>
<td>200</td>
</tr>
<tr>
<td>N-th Character Reverse</td>
<td>31,316</td>
<td>200</td>
</tr>
<tr>
<td>Case Conversion</td>
<td>27,738</td>
<td>200</td>
</tr>
<tr>
<td>Common Substrings</td>
<td>4,800</td>
<td>200</td>
</tr>
<tr>
<td>Longest Common Substrings</td>
<td>4,800</td>
<td>200</td>
</tr>
<tr>
<td>Longest Common Subsequences</td>
<td>4,800</td>
<td>200</td>
</tr>
<tr>
<td rowspan="4">Typographical Variation</td>
<td>GSM8K</td>
<td>-</td>
<td>1,319</td>
</tr>
<tr>
<td>MMLU</td>
<td>-</td>
<td>14,042</td>
</tr>
<tr>
<td>TruthfulQA</td>
<td>-</td>
<td>817</td>
</tr>
<tr>
<td>HumalEval</td>
<td>-</td>
<td>164</td>
</tr>
</table>
## Citation
```bibtex
@inproceedings{chai2024tokenization,
title={Tokenization Falling Short: On Subword Robustness in Large Language Models},
author={Chai, Yekun and Fang, Yewei and Peng, Qiwei and Li, Xuhong},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024},
pages={1582--1599},
year={2024}
}
```
提供机构:
floatai



