Name: floatai/TKEval
Creator: floatai
Published: 2024-12-11 12:53:59
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/floatai/TKEval

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text2text-generation language: - en --- # Dataset Card for TKEval ## Contents - [Dataset Description](#dataset-description) - [Dataset Structure](#dataset-structure) - [Dataset Splits](#data-splits) - [Citation](#citation) ## Dataset Description **_The curse of tokenization_**: Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens. TKEval is an evalution benchmark for systematicly assessing the impact of _"The curse of tokenization"_ on language model performance. - **Repository**: https://github.com/FloatAI/TKEval - **Paper**: https://arxiv.org/pdf/2406.11687 ## Dataset Structure ``` . ├── complex_problem_solving │ ├── cycled_letters_all_data_0123_shots.json │ ├── identify_math_theorems_all_data_0123_shots.json │ └── word_unscrambling_all_data_0123_shots.json ├── token_structure_probing │ ├── test │ │ ├── multi_token_prob.common_substrings.all_data_0123_shots.json │ │ ├── multi_token_prob.longest_common_subsequences.all_data_0123_shots.json │ │ ├── multi_token_prob.longest_common_substrings.all_data_0123_shots.json │ │ ├── token_struct_prob.char_case_conversion.all_data_0123_shots.json │ │ ├── token_struct_prob.character_count.all_data_0123_shots.json │ │ ├── token_struct_prob.nth_character.all_data_0123_shots.json │ │ └── token_struct_prob.nth_character_from_end.all_data_0123_shots.json │ └── train │ ├── multi_token_prob.common_substrings.jsonl │ ├── multi_token_prob.longest_common_subsequences.jsonl │ ├── multi_token_prob.longest_common_substrings.jsonl │ ├── token_struct_prob.char_case_conversion.jsonl │ ├── token_struct_prob.character_count.jsonl │ ├── token_struct_prob.nth_character.jsonl │ └── token_struct_prob.nth_character_from_end.jsonl └── typographical_variation ├── data.typo.char.noise │ ├── ngram_2 │ ├── ngram_3 │ └── ngram_5 ├── data.typo.char.permute │ ├── ngram_2 │ ├── ngram_3 │ └── ngram_5 ├── data.typo.token.noise │ ├── llama3 │ └── mistral └── data.typo.token.permute ├── llama3 └── mistral ``` ## Data Splits <table> <tr> <th>Main Task</th> <th>Sub Task</th> <th>Train</th> <th>Test</th> </tr> <tr> <td rowspan="3">Complex Problem Solving</td> <td>Cycled Letters in Word</td> <td>-</td> <td>20,975</td> </tr> <tr> <td>Word Unscrambling</td> <td>-</td> <td>8,917</td> </tr> <tr> <td>Identify Math Theorems</td> <td>-</td> <td>53</td> </tr> <tr> <td rowspan="7">Token Structure Probe</td> <td>Character Count</td> <td>20,775</td> <td>200</td> </tr> <tr> <td>N-th Character</td> <td>31,241</td> <td>200</td> </tr> <tr> <td>N-th Character Reverse</td> <td>31,316</td> <td>200</td> </tr> <tr> <td>Case Conversion</td> <td>27,738</td> <td>200</td> </tr> <tr> <td>Common Substrings</td> <td>4,800</td> <td>200</td> </tr> <tr> <td>Longest Common Substrings</td> <td>4,800</td> <td>200</td> </tr> <tr> <td>Longest Common Subsequences</td> <td>4,800</td> <td>200</td> </tr> <tr> <td rowspan="4">Typographical Variation</td> <td>GSM8K</td> <td>-</td> <td>1,319</td> </tr> <tr> <td>MMLU</td> <td>-</td> <td>14,042</td> </tr> <tr> <td>TruthfulQA</td> <td>-</td> <td>817</td> </tr> <tr> <td>HumalEval</td> <td>-</td> <td>164</td> </tr> </table> ## Citation ```bibtex @inproceedings{chai2024tokenization, title={Tokenization Falling Short: On Subword Robustness in Large Language Models}, author={Chai, Yekun and Fang, Yewei and Peng, Qiwei and Li, Xuhong}, booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024}, pages={1582--1599}, year={2024} } ```

应用场景：