Khawn2u/Fixberry

Name: Khawn2u/Fixberry
Creator: Khawn2u
Published: 2024-10-13 08:24:31
License: 暂无描述

Hugging Face2024-10-13 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/Khawn2u/Fixberry

下载链接

链接失效反馈

官方服务：

资源简介：

FixBerry是一个小型数据集，旨在训练模型正确计算单词中的字母数量。作者指出，即使是目前最好的大型语言模型（LLMs）在计算某些单词中的特定字母数量时也会失败，例如strawberry中的R数量。作者还提到，模型在处理某些单词时存在问题，如keeper和parallel，但在处理其他单词如pepper和peeper时却没有问题。作者怀疑这可能与分词（tokenization）有关，即模型可能每个token只计算一个字母，即使该token包含两个字母。作者强烈建议不要使用这个数据集，而是建议解决核心问题。数据集是从一个包含所有英语单词的CSV文件中处理而来的。

FixBerry is a dataset designed to train models to correctly count the number of letters in a word. The dataset is derived from a list of English words on Kaggle, processed into Data.csv. The author notes that even the best large language models fail to correctly count specific letters in certain words, such as the number of Rs in strawberry. The model may perform well on some words but poorly on others, which could be related to tokenization and the model counting only one letter per token, even if the token contains two letters.

提供机构：

Khawn2u

5,000+

优质数据集

54 个

任务类型

进入经典数据集