five

floatai/TKEval

收藏
Hugging Face2024-12-11 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/floatai/TKEval
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text2text-generation language: - en --- # Dataset Card for TKEval ## Contents - [Dataset Description](#dataset-description) - [Dataset Structure](#dataset-structure) - [Dataset Splits](#data-splits) - [Citation](#citation) ## Dataset Description **_The curse of tokenization_**: Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens. TKEval is an evalution benchmark for systematicly assessing the impact of _"The curse of tokenization"_ on language model performance. - **Repository**: https://github.com/FloatAI/TKEval - **Paper**: https://arxiv.org/pdf/2406.11687 ## Dataset Structure ``` . ├── complex_problem_solving │   ├── cycled_letters_all_data_0123_shots.json │   ├── identify_math_theorems_all_data_0123_shots.json │   └── word_unscrambling_all_data_0123_shots.json ├── token_structure_probing │   ├── test │   │   ├── multi_token_prob.common_substrings.all_data_0123_shots.json │   │   ├── multi_token_prob.longest_common_subsequences.all_data_0123_shots.json │   │   ├── multi_token_prob.longest_common_substrings.all_data_0123_shots.json │   │   ├── token_struct_prob.char_case_conversion.all_data_0123_shots.json │   │   ├── token_struct_prob.character_count.all_data_0123_shots.json │   │   ├── token_struct_prob.nth_character.all_data_0123_shots.json │   │   └── token_struct_prob.nth_character_from_end.all_data_0123_shots.json │   └── train │   ├── multi_token_prob.common_substrings.jsonl │   ├── multi_token_prob.longest_common_subsequences.jsonl │   ├── multi_token_prob.longest_common_substrings.jsonl │   ├── token_struct_prob.char_case_conversion.jsonl │   ├── token_struct_prob.character_count.jsonl │   ├── token_struct_prob.nth_character.jsonl │   └── token_struct_prob.nth_character_from_end.jsonl └── typographical_variation ├── data.typo.char.noise │   ├── ngram_2 │   ├── ngram_3 │   └── ngram_5 ├── data.typo.char.permute │   ├── ngram_2 │   ├── ngram_3 │   └── ngram_5 ├── data.typo.token.noise │   ├── llama3 │   └── mistral └── data.typo.token.permute ├── llama3 └── mistral ``` ## Data Splits <table> <tr> <th>Main Task</th> <th>Sub Task</th> <th>Train</th> <th>Test</th> </tr> <tr> <td rowspan="3">Complex Problem Solving</td> <td>Cycled Letters in Word</td> <td>-</td> <td>20,975</td> </tr> <tr> <td>Word Unscrambling</td> <td>-</td> <td>8,917</td> </tr> <tr> <td>Identify Math Theorems</td> <td>-</td> <td>53</td> </tr> <tr> <td rowspan="7">Token Structure Probe</td> <td>Character Count</td> <td>20,775</td> <td>200</td> </tr> <tr> <td>N-th Character</td> <td>31,241</td> <td>200</td> </tr> <tr> <td>N-th Character Reverse</td> <td>31,316</td> <td>200</td> </tr> <tr> <td>Case Conversion</td> <td>27,738</td> <td>200</td> </tr> <tr> <td>Common Substrings</td> <td>4,800</td> <td>200</td> </tr> <tr> <td>Longest Common Substrings</td> <td>4,800</td> <td>200</td> </tr> <tr> <td>Longest Common Subsequences</td> <td>4,800</td> <td>200</td> </tr> <tr> <td rowspan="4">Typographical Variation</td> <td>GSM8K</td> <td>-</td> <td>1,319</td> </tr> <tr> <td>MMLU</td> <td>-</td> <td>14,042</td> </tr> <tr> <td>TruthfulQA</td> <td>-</td> <td>817</td> </tr> <tr> <td>HumalEval</td> <td>-</td> <td>164</td> </tr> </table> ## Citation ```bibtex @inproceedings{chai2024tokenization, title={Tokenization Falling Short: On Subword Robustness in Large Language Models}, author={Chai, Yekun and Fang, Yewei and Peng, Qiwei and Li, Xuhong}, booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024}, pages={1582--1599}, year={2024} } ```
提供机构:
floatai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作