AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M

Name: AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M
Creator: AmanPriyanshu
Published: 2025-10-04 05:31:40
License: 暂无描述

Hugging Face2025-10-04 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个由Tulu-3 SFT Mixture和Orca AgentInstruct两个数据源的高质量指令遵循数据集。数据集提供50k、100k、250k、500k和1m五种不同规模的版本，每种规模代表不同的样本数量。数据集被分成多个类别，包括创意内容、分析推理、代码、知识和问答、创意和沟通、安全和一致性。每个示例包含对话轮次列表、任务类别和源数据集标识符。采样方法使用基于嵌入的k-means聚类来确保多样性和代表性。数据集的许可证根据来源不同而不同。

This dataset is a combination of high-quality instruction-following data from two sources: Tulu-3 SFT Mixture and Orca AgentInstruct. It is available in different sizes: 50k, 100k, 250k, 500k, and 1m, with each size indicating the number of samples it contains. The data is split into various categories such as creative content, analytical reasoning, code, knowledge & QA, creative & communication, and safety & alignment. Each example contains a list of conversation turns, a task category, and a source dataset identifier. The sampling methodology uses embedding-based k-means clustering to ensure diversity and representativeness. The dataset is licensed under different licenses depending on the source.

提供机构：

AmanPriyanshu