beccabai/slimpajama_labeled
收藏Hugging Face2024-10-21 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/beccabai/slimpajama_labeled
下载链接
链接失效反馈官方服务:
资源简介:
该数据集用于论文《Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining》,是SlimPajama-627B训练数据集的标注版本。数据集包含文本生成任务,语言为英语。每个数据项包括ID、内容和元数据,元数据中包含了属性如教育相关性和主题分类。数据集还定义了不同主题的标签,如活动、教育、娱乐等。
This dataset is a labeled version used in the paper Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining, based on the SlimPajama-627B train dataset. It is designed for text generation tasks in English. Each sample includes a unique id, content text, and metadata, which includes attribute scores and domain information. Additionally, the dataset provides a mapping of topics and labels, covering categories such as activity, education, entertainment, finance, health, business and industrial, infrastructure, literature and art, nature, others, law and government, networking, and technology.
提供机构:
beccabai



