newsbang/math_benbench_data_leak_analysis
收藏Hugging Face2024-12-06 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/newsbang/math_benbench_data_leak_analysis
下载链接
链接失效反馈官方服务:
资源简介:
这是一个由四个开源数据集混合而成的数学数据集,包含100万个样本,用于分析MATH数据集上的污染测试。数据集包含多个字段,如问题、解答、5-gram列表等,并详细说明了每个字段的含义。数据集的组成包括来自不同开源数据集的样本数量,如OpenMathInstruct-2、MetaMath、Orca-math-word-problems-200k和MathInstruct。数据集分析部分展示了不同子集在MATH数据集上的相关性,并提供了实验结果的详细分析。
This is a math dataset mixed from four open-source data. It was used to analyze the contamination test on the MATH dataset and contains 1M samples. The dataset includes fields such as question, solution, 5-gram list, and more. It is composed of four subsets: open_math, meta_math, orca_math, and math_instruct, each with specific sources and sample counts. The dataset also includes detailed analysis, setting a score threshold to determine sample relevance and showing the proportion of samples above the threshold in each subset. The experimental results section shows the test results after fine-tuning with the Qwen2.5-7B model, analyzing the data leakage situation and its impact on model performance.
提供机构:
newsbang



