SHED
收藏arXiv2024-04-23 更新2024-07-23 收录
下载链接:
https://github.com/Lucidreamer9/SHED-Shapley-Based-Automated-Dataset-Refinement
下载链接
链接失效反馈官方服务:
资源简介:
SHED是由马里兰大学开发的一个基于Shapley值的自动化数据集精炼框架,旨在通过精炼大型语言模型(LLMs)的数据集来提高微调效率。该数据集通过模型无关的聚类、代理基Shapley计算器和优化感知采样三个关键组件,从原始数据集中选择代表性样本进行Shapley值评估,从而构建出一个小而高质量的数据集。SHED不仅减少了计算复杂性,还提高了数据集的转移性,使其能在不同的LLMs模型中保持高性能。该数据集主要应用于提升LLMs在特定任务上的性能,解决大规模数据集中的冗余和噪声问题。
SHED is an automated dataset refinement framework based on Shapley values developed by the University of Maryland, which aims to improve the fine-tuning efficiency of large language models (LLMs) by refining their training datasets. This framework constructs a compact yet high-quality dataset by selecting representative samples from the original dataset for Shapley value evaluation through three core components: model-agnostic clustering, proxy-based Shapley calculator, and optimization-aware sampling. SHED not only reduces computational complexity but also enhances the transferability of the refined dataset, allowing it to maintain high performance across different LLMs. It is primarily applied to boost the performance of LLMs on specific tasks, addressing the problems of redundancy and noise in large-scale datasets.
提供机构:
马里兰大学
创建时间:
2024-04-23
原始信息汇总



