AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M
收藏Hugging Face2025-10-04 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M
下载链接
链接失效反馈官方服务:
资源简介:
这是一个由Tulu-3 SFT Mixture和Orca AgentInstruct两个数据源的高质量指令遵循数据集。数据集提供50k、100k、250k、500k和1m五种不同规模的版本,每种规模代表不同的样本数量。数据集被分成多个类别,包括创意内容、分析推理、代码、知识和问答、创意和沟通、安全和一致性。每个示例包含对话轮次列表、任务类别和源数据集标识符。采样方法使用基于嵌入的k-means聚类来确保多样性和代表性。数据集的许可证根据来源不同而不同。
This dataset is a combination of high-quality instruction-following data from two sources: Tulu-3 SFT Mixture and Orca AgentInstruct. It is available in different sizes: 50k, 100k, 250k, 500k, and 1m, with each size indicating the number of samples it contains. The data is split into various categories such as creative content, analytical reasoning, code, knowledge & QA, creative & communication, and safety & alignment. Each example contains a list of conversation turns, a task category, and a source dataset identifier. The sampling methodology uses embedding-based k-means clustering to ensure diversity and representativeness. The dataset is licensed under different licenses depending on the source.
提供机构:
AmanPriyanshu



