Various NLP Datasets
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/Kinds-of-Intelligence-CFI/benchmark-ground-truth-predictability
下载链接
链接失效反馈官方服务:
资源简介:
该数据集用于研究基于n-gram构建的逻辑回归分类器在不同针对大型语言模型(LLM)设计的基准测试中预测真实标签的能力。论文通过展示简单分类器如何在缺乏真实能力的情况下仍能取得高分,来评估基准测试的内部有效性。此外,研究还根据n-gram分类器的成功与失败预测,探讨了大型语言模型的表现。这些分析涵盖了多个数据集,任务的目的是基于基准实例的n-gram特征来预测标签。
This dataset is designed to investigate the capability of n-gram-based logistic regression classifiers in predicting ground-truth labels across various benchmarks tailored for large language models (LLMs). This study evaluates the internal validity of these benchmarks by demonstrating that simple classifiers can achieve high scores even when lacking genuine task-solving capabilities. Furthermore, the research examines the performance of large language models by analyzing the successful and failed prediction outcomes of the n-gram-based classifiers. These analyses cover multiple datasets, where the core task is to predict labels using the n-gram features of benchmark instances.



