five

VillanovaAI/villanova-sft-2603

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/VillanovaAI/villanova-sft-2603
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - de - en - es - fr - it license: apache-2.0 task_categories: - text-generation - question-answering tags: - sft - multilingual - instruction-tuning - chat - safety - reasoning - villanova size_categories: - 1M<n<10M pretty_name: Villanova SFT 2603 dataset_info: features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: source_data dtype: string - name: subset dtype: string - name: category dtype: string - name: language dtype: string - name: token_count dtype: int32 splits: - name: train num_examples: 1711114 --- # Villanova-SFT-2603 **Villanova-SFT-2603** is a large-scale, multilingual supervised fine-tuning (SFT) collection of datasets. It contains **1,711,114 instruction-response conversations** spanning five European languages, covering chat, reasoning, code, knowledge, and safety tasks. This dataset was used to train the [Villanova-2B-2603](https://huggingface.co/VillanovaAI/Villanova-2B-2603) model family. All data has been processed through a rigorous curation pipeline that enforces schema normalization, hash-based deduplication, token-length filtering, language verification, and identity decontamination. --- ## Dataset Summary | | | |---|---| | **Total Examples** | 1,711,114 | | **Languages** | English, French, German, Italian, Spanish | | **Format** | Multi-turn chat (list of messages) | | **Max Sequence Length** | 4,096 tokens | | **Categories** | Chat, Reasoning, Safety | | **Deduplication** | Hash-based, applied globally across all sources | | **License** | Apache 2.0 | --- ## Trained Model This dataset was used as the fine-tuning mixture for the following model: | | | |---|---| | **Model** | [Villanova-2B-2603](https://huggingface.co/VillanovaAI/Villanova-2B-2603) | | **Base Model** | [Villanova-2B-Base-2603](https://huggingface.co/VillanovaAI/Villanova-2B-Base-2603) | | **Parameters** | 2.4B | | **Architecture** | Decoder-only Transformer (LLaMA-based) | | **Context Length** | 32,768 tokens | | **License** | Apache 2.0 | The resulting model achieved **#1 fully open model** in overall average performance across 25 benchmarks, with particular strength in multilingual instruction following and safety alignment. For full evaluation results, see the [model card](https://huggingface.co/VillanovaAI/Villanova-2B-2603). --- ## Language Distribution | Language | Examples | Percentage | |---|---:|---:| | English | 754,682 | 44.1% | | French | 240,130 | 14.0% | | German | 238,468 | 13.9% | | Italian | 234,607 | 13.7% | | Spanish | 243,227 | 14.2% | --- ## Source Distribution The dataset is constructed from 24 curated data sources. Each source has been converted to a unified chat format, deduplicated, and filtered for quality. | Source | Examples | Share | |---|---:|---:| | CohereLabs/aya_collection | 928,389 | 54.26% | | SmolTalk (Multilingual) | 127,000 | 7.42% | | VillanovaAI/multi_smoltalk_summarize_no_think | 96,011 | 5.61% | | VillanovaAI/Multi-Persona-IF | 74,972 | 4.38% | | VillanovaAI/Multi-SciRIFF | 67,794 | 3.96% | | VillanovaAI/multi-oasst2 | 67,589 | 3.95% | | VillanovaAI/multi-smol_rewrite | 53,262 | 3.11% | | VillanovaAI/multi-python-alpaca | 51,949 | 3.04% | | VillanovaAI/multi-dialogues | 48,004 | 2.81% | | VillanovaAI/Multi-FLAN-NIv2 | 39,797 | 2.33% | | VillanovaAI/Multi-FLAN-CoT | 39,586 | 2.31% | | openai/gsm8k | 37,365 | 2.18% | | VillanovaAI/multi-SelfCodeAlign | 17,558 | 1.03% | | VillanovaAI/multi-dolly-15k | 14,963 | 0.87% | | VillanovaAI/Multi-TableGPT | 13,117 | 0.77% | | projecte-aina/RAG_Multilingual | 10,459 | 0.61% | | VillanovaAI/Multi-FLAN-P3 | 6,000 | 0.35% | | projecte-aina/MentorES | 4,578 | 0.27% | | VillanovaAI/Multi-FLAN-Flan2021 | 3,906 | 0.23% | | VillanovaAI/multi-aya_redteaming | 3,604 | 0.21% | | VillanovaAI/Multi-Safety-Dataset | 2,500 | 0.15% | | VillanovaAI/aya-masakhanews-en-fr | 2,017 | 0.12% | | VillanovaAI/Multi-AdvBench | 520 | 0.03% | | VillanovaAI/Villanova-hard-coded | 174 | 0.01% | --- ## Data Schema Every example in the dataset conforms to the following unified schema: | Field | Type | Description | |---|---|---| | `messages` | `list[{role: str, content: str}]` | Conversation turns in chat format. Roles: `system`, `user`, `assistant` | | `source_data` | `string` | Identifier of the originating dataset or repository | | `subset` | `string` | Fine-grained subset label within the source | | `category` | `string` | Task category: `Chat`, `Reasoning`, `Code`, `Knowledge`, `Safety` | | `language` | `string` | ISO 639 language code (e.g., `eng`, `deu`, `spa`, `fra`, `ita`) | | `token_count` | `int32` | Number of tokens in the conversation (as tokenized by the target model) | ### Example ```json { "messages": [ {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "The capital of France is Paris."} ], "source_data": "VillanovaAI/multi-dolly-15k", "subset": "dolly", "category": "Chat", "language": "eng", "token_count": 28 } ``` --- ## Chat Template The dataset is designed for use with the Villanova ChatML-style template. A default system prompt is injected when no explicit system message is present: ``` <|im_start|>system You are Villanova, a helpful AI assistant built by Villanova.AI.<|im_end|> <|im_start|>user {user_message}<|im_end|> <|im_start|>assistant {assistant_message}<|im_end|> ``` --- ## Loading the Dataset ```python from datasets import load_dataset dataset = load_dataset("VillanovaAI/villanova-sft-2603", split="train") print(f"Total examples: {len(dataset):,}") print(dataset[0]) ``` ### Filtering by Language ```python english_data = dataset.filter(lambda x: x["language"] == "eng") german_data = dataset.filter(lambda x: x["language"] == "deu") ``` ### Filtering by Category ```python safety_data = dataset.filter(lambda x: x["category"] == "Safety") reasoning_data = dataset.filter(lambda x: x["category"] == "Reasoning") ``` --- ## Quality Control Pipeline The following quality controls are applied during dataset construction: 1. **Schema Normalization** -- All sources are converted to a unified schema with standardized message formatting, ensuring consistency across heterogeneous data origins. 2. **Token-Length Filtering** -- Conversations exceeding 4,096 tokens (as measured by the target model tokenizer) are excluded to maintain training efficiency and prevent truncation artifacts. 3. **Hash-Based Deduplication** -- A content hash is computed for every conversation. Duplicates are removed globally across all sources to eliminate redundancy. 4. **Language Verification** -- For sources where language metadata is available, strict language-code matching is enforced. For sources requiring additional verification, automated language detection is applied to conversation content. 5. **Identity Decontamination** -- A multi-stage identity decontamination pipeline is applied to remove references to other AI systems and ensure consistent model identity throughout the dataset. --- ## Benchmark Impact The model trained on this dataset achieved the following results among fully open models (all weights, data, and training details publicly released): | Category | Score | Ranking | |---|---|---| | Overall | 36.9 | #1 Fully Open | | Instruction Following | 45.1 | #1 Fully Open | | Safety (M-ALERT) | 39.5 | #1 Fully Open | | Reasoning | 31.0 | #2 Fully Open | | Question Answering | 33.1 | #2 Fully Open | For detailed per-benchmark results and comparisons with open weight models, see the [Villanova-2B-2603 model card](https://huggingface.co/VillanovaAI/Villanova-2B-2603). --- ## Related Resources - [Villanova-2B-2603](https://huggingface.co/VillanovaAI/Villanova-2B-2603) -- Instruction-tuned model trained on this dataset - [Villanova-2B-2603-GGUF](https://huggingface.co/VillanovaAI/Villanova-2B-2603-GGUF) -- Quantized version for efficient deployment - [Villanova-2B-VL-2603](https://huggingface.co/VillanovaAI/Villanova-2B-VL-2603) -- Vision-Language variant - [Villanova-2B-Base-2603](https://huggingface.co/VillanovaAI/Villanova-2B-Base-2603) -- Base model (4.4T tokens, pretrained from scratch) --- ## License The Villanova-SFT-2603 dataset curation, processing pipeline, and all original contributions by Villanova.AI are released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). This dataset incorporates data from multiple third-party open-source datasets, each of which is governed by its own original license. Users are responsible for reviewing and complying with the licensing terms of each constituent source. The licenses of the included sources are as follows: | Source | License | |---|---| | CohereLabs/aya_collection | Apache 2.0 | | SmolTalk (Multilingual) | Apache 2.0 | | VillanovaAI/multi_smoltalk_summarize_no_think | Apache 2.0 | | VillanovaAI/Multi-Persona-IF | ODC-BY-1.0 | | VillanovaAI/Multi-SciRIFF | ODC-BY | | VillanovaAI/multi-oasst2 | Apache 2.0 | | VillanovaAI/multi-smol_rewrite | Apache 2.0 | | VillanovaAI/multi-python-alpaca | Apache 2.0 | | VillanovaAI/multi-dialogues | ODC-BY-1.0 | | VillanovaAI/Multi-FLAN-NIv2 | CC BY 4.0 | | VillanovaAI/Multi-FLAN-CoT | CC BY 4.0 | | openai/gsm8k | MIT | | VillanovaAI/multi-SelfCodeAlign | ODC-BY-1.0 | | VillanovaAI/multi-dolly-15k | CC-BY-SA-3.0 | | VillanovaAI/Multi-TableGPT | MIT | | projecte-aina/RAG_Multilingual | CC-BY-SA-4.0 | | VillanovaAI/Multi-FLAN-P3 | CC BY 4.0 | | projecte-aina/MentorES | CC-BY-4.0 | | VillanovaAI/Multi-FLAN-Flan2021 | CC BY 4.0 | | VillanovaAI/multi-aya_redteaming | Apache 2.0 | | VillanovaAI/Multi-Safety-Dataset | ODC-BY-1.0 | | VillanovaAI/aya-masakhanews-en-fr | AFL-3.0 | | VillanovaAI/Multi-AdvBench | MIT | | VillanovaAI/Villanova-hard-coded | CC BY 4.0 |

--- 语言: - 德语 - 英语 - 西班牙语 - 法语 - 意大利语 许可证:Apache 2.0 任务类别: - 文本生成 - 问答 标签: - 监督微调(Supervised Fine-Tuning,SFT) - 多语言 - 指令微调 - 聊天 - 安全 - 推理 - 维拉诺瓦 样本量区间:100万 < 样本数 < 1000万 友好展示名称:Villanova SFT 2603 数据集详情: 特征字段: - 名称:messages(对话消息) 列表元素: - 名称:role(角色) 数据类型:字符串 - 名称:content(内容) 数据类型:字符串 - 名称:source_data(源数据源标识) 数据类型:字符串 - 名称:subset(细分子集) 数据类型:字符串 - 名称:category(任务类别) 数据类型:字符串 - 名称:language(语言代码) 数据类型:字符串 - 名称:token_count(令牌数) 数据类型:32位整数(int32) 数据划分: - 名称:训练集(train) 样本数:1711114 --- # Villanova-SFT-2603 数据集 **Villanova-SFT-2603** 是一款大规模多语言监督微调(Supervised Fine-Tuning,SFT)数据集集合,包含1,711,114条指令-回复对话,覆盖五种欧洲语言,涵盖聊天、推理、代码、知识与安全任务。该数据集被用于训练[Villanova-2B-2603](https://huggingface.co/VillanovaAI/Villanova-2B-2603)模型系列。 所有数据均经过严格的治理流水线处理,包括Schema标准化、基于哈希的去重、令牌长度过滤、语言验证与身份净化。 --- ## 数据集摘要 | 指标 | 详情 | |---|---| | **总样本数** | 1,711,114 | | **支持语言** | 英语、法语、德语、意大利语、西班牙语 | | **数据格式** | 多轮聊天(消息列表形式) | | **最大序列长度** | 4096令牌 | | **任务类别** | 聊天、推理、安全 | | **去重方式** | 基于哈希全局去重 | | **许可证** | Apache 2.0 | --- ## 训练模型 本数据集被用作以下模型的微调混合训练数据: | 指标 | 详情 | |---|---| | **模型** | [Villanova-2B-2603](https://huggingface.co/VillanovaAI/Villanova-2B-2603) | | **基础模型** | [Villanova-2B-Base-2603](https://huggingface.co/VillanovaAI/Villanova-2B-Base-2603) | | **参数规模** | 24亿 | | **模型架构** | 仅解码器Transformer(基于LLaMA) | | **上下文长度** | 32768令牌 | | **许可证** | Apache 2.0 | 该训练得到的模型在25个基准测试的整体平均性能中位列**全开源模型第一**,在多语言指令遵循与安全对齐方面表现尤为突出。完整评估结果请参阅[模型卡片](https://huggingface.co/VillanovaAI/Villanova-2B-2603)。 --- ## 语言分布 | 语言 | 样本数 | 占比 | |---|---:|---:| | 英语 | 754,682 | 44.1% | | 法语 | 240,130 | 14.0% | | 德语 | 238,468 | 13.9% | | 意大利语 | 234,607 | 13.7% | | 西班牙语 | 243,227 | 14.2% | --- ## 数据源分布 本数据集由24个精选数据源构建而成。每个数据源均被转换为统一的聊天格式,并经过去重与质量过滤。 | 数据源 | 样本数 | 占比 | |---|---:|---:| | CohereLabs/aya_collection | 928,389 | 54.26% | | SmolTalk (Multilingual) | 127,000 | 7.42% | | VillanovaAI/multi_smoltalk_summarize_no_think | 96,011 | 5.61% | | VillanovaAI/Multi-Persona-IF | 74,972 | 4.38% | | VillanovaAI/Multi-SciRIFF | 67,794 | 3.96% | | VillanovaAI/multi-oasst2 | 67,589 | 3.95% | | VillanovaAI/multi-smol_rewrite | 53,262 | 3.11% | | VillanovaAI/multi-python-alpaca | 51,949 | 3.04% | | VillanovaAI/multi-dialogues | 48,004 | 2.81% | | VillanovaAI/Multi-FLAN-NIv2 | 39,797 | 2.33% | | VillanovaAI/Multi-FLAN-CoT | 39,586 | 2.31% | | openai/gsm8k | 37,365 | 2.18% | | VillanovaAI/multi-SelfCodeAlign | 17,558 | 1.03% | | VillanovaAI/multi-dolly-15k | 14,963 | 0.87% | | VillanovaAI/Multi-TableGPT | 13,117 | 0.77% | | projecte-aina/RAG_Multilingual | 10,459 | 0.61% | | VillanovaAI/Multi-FLAN-P3 | 6,000 | 0.35% | | projecte-aina/MentorES | 4,578 | 0.27% | | VillanovaAI/Multi-FLAN-Flan2021 | 3,906 | 0.23% | | VillanovaAI/multi-aya_redteaming | 3,604 | 0.21% | | VillanovaAI/Multi-Safety-Dataset | 2,500 | 0.15% | | VillanovaAI/aya-masakhanews-en-fr | 2,017 | 0.12% | | VillanovaAI/Multi-AdvBench | 520 | 0.03% | | VillanovaAI/Villanova-hard-coded | 174 | 0.01% | --- ## 数据Schema 数据集中的每个样本均遵循以下统一Schema: | 字段 | 类型 | 描述 | |---|---|---| | `messages` | `list[{role: str, content: str}]` | 聊天格式的对话轮次。角色包括:`system`(系统)、`user`(用户)、`assistant`(助手) | | `source_data` | `string` | 原始数据集或仓库的标识符 | | `subset` | `string` | 数据源内的细粒度子集标签 | | `category` | `string` | 任务类别:`Chat`(聊天)、`Reasoning`(推理)、`Code`(代码)、`Knowledge`(知识)、`Safety`(安全) | | `language` | `string` | ISO 639语言代码(例如`eng`、`deu`、`spa`、`fra`、`ita`) | | `token_count` | `int32` | 对话的令牌数(以目标模型的分词器计算) | ### 样本示例 json { "messages": [ {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "The capital of France is Paris."} ], "source_data": "VillanovaAI/multi-dolly-15k", "subset": "dolly", "category": "Chat", "language": "eng", "token_count": 28 } --- ## 聊天模板 本数据集适配Villanova的ChatML风格模板。当未显式提供系统消息时,将自动注入默认系统提示: <|im_start|>system You are Villanova, a helpful AI assistant built by Villanova.AI.<|im_end|> <|im_start|>user {user_message}<|im_end|> <|im_start|>assistant {assistant_message}<|im_end|> --- ## 数据集加载 可以通过以下代码加载本数据集: python from datasets import load_dataset dataset = load_dataset("VillanovaAI/villanova-sft-2603", split="train") print(f"Total examples: {len(dataset):,}") print(dataset[0]) ### 按语言过滤 python english_data = dataset.filter(lambda x: x["language"] == "eng") german_data = dataset.filter(lambda x: x["language"] == "deu") ### 按任务类别过滤 python safety_data = dataset.filter(lambda x: x["category"] == "Safety") reasoning_data = dataset.filter(lambda x: x["category"] == "Reasoning") --- ## 质量控制流水线 数据集构建过程中应用了以下质量控制措施: 1. **Schema标准化**:将所有数据源转换为统一的Schema,并采用标准化的消息格式,确保异构数据源之间的一致性。 2. **令牌长度过滤**:剔除超过目标模型分词器所限定的4096令牌长度的对话,以保障训练效率并避免截断伪影。 3. **基于哈希的去重**:为每一条对话计算内容哈希,全局移除所有数据源中的重复样本,消除冗余。 4. **语言验证**:对于带有语言元数据的数据源,严格执行语言代码匹配;对于需要额外验证的数据源,采用自动语言检测工具对对话内容进行校验。 5. **身份净化**:采用多阶段身份净化流水线,移除所有对其他AI系统的引用,确保数据集中的模型身份一致性。 --- ## 基准测试影响 基于本数据集训练得到的模型在全开源模型(所有权重、数据与训练细节均公开)中取得了以下成绩: | 类别 | 得分 | 排名 | |---|---|---| | 整体 | 36.9 | 全开源模型第1位 | | 指令遵循 | 45.1 | 全开源模型第1位 | | 安全(M-ALERT) | 39.5 | 全开源模型第1位 | | 推理 | 31.0 | 全开源模型第2位 | | 问答 | 33.1 | 全开源模型第2位 | 详细的单基准测试结果与与其他开源权重模型的对比,请参阅[Villanova-2B-2603模型卡片](https://huggingface.co/VillanovaAI/Villanova-2B-2603)。 --- ## 相关资源 - [Villanova-2B-2603](https://huggingface.co/VillanovaAI/Villanova-2B-2603) -- 基于本数据集训练的指令微调模型 - [Villanova-2B-2603-GGUF](https://huggingface.co/VillanovaAI/Villanova-2B-2603-GGUF) -- 用于高效部署的量化版本 - [Villanova-2B-VL-2603](https://huggingface.co/VillanovaAI/Villanova-2B-VL-2603) -- 多模态视觉语言变体模型 - [Villanova-2B-Base-2603](https://huggingface.co/VillanovaAI/Villanova-2B-Base-2603) -- 基础模型(预训练自 scratch,包含4.4T令牌) --- ## 许可证 Villanova-SFT-2603数据集的治理、处理流水线以及Villanova.AI贡献的所有原创内容均采用[Apache 2.0许可证](https://www.apache.org/licenses/LICENSE-2.0)发布。 本数据集整合了多个第三方开源数据集,每个数据源均受其原始许可证约束。使用者需自行审查并遵守各组成数据源的许可条款。各数据源的许可证如下: | 数据源 | 许可证 | |---|---| | CohereLabs/aya_collection | Apache 2.0 | | SmolTalk (Multilingual) | Apache 2.0 | | VillanovaAI/multi_smoltalk_summarize_no_think | Apache 2.0 | | VillanovaAI/Multi-Persona-IF | ODC-BY-1.0 | | VillanovaAI/Multi-SciRIFF | ODC-BY | | VillanovaAI/multi-oasst2 | Apache 2.0 | | VillanovaAI/multi-smol_rewrite | Apache 2.0 | | VillanovaAI/multi-python-alpaca | Apache 2.0 | | VillanovaAI/multi-dialogues | ODC-BY-1.0 | | VillanovaAI/Multi-FLAN-NIv2 | CC BY 4.0 | | VillanovaAI/Multi-FLAN-CoT | CC BY 4.0 | | openai/gsm8k | MIT | | VillanovaAI/multi-SelfCodeAlign | ODC-BY-1.0 | | VillanovaAI/multi-dolly-15k | CC-BY-SA-3.0 | | VillanovaAI/Multi-TableGPT | MIT | | projecte-aina/RAG_Multilingual | CC-BY-SA-4.0 | | VillanovaAI/Multi-FLAN-P3 | CC BY 4.0 | | projecte-aina/MentorES | CC-BY-4.0 | | VillanovaAI/Multi-FLAN-Flan2021 | CC BY 4.0 | | VillanovaAI/multi-aya_redteaming | Apache 2.0 | | VillanovaAI/Multi-Safety-Dataset | ODC-BY-1.0 | | VillanovaAI/aya-masakhanews-en-fr | AFL-3.0 | | VillanovaAI/Multi-AdvBench | MIT | | VillanovaAI/Villanova-hard-coded | CC BY 4.0 |
提供机构:
VillanovaAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作