McGill-NLP/llm2vec-gen-tulu-w-hard-negative

Name: McGill-NLP/llm2vec-gen-tulu-w-hard-negative
Creator: McGill-NLP
Published: 2026-03-02 16:38:19
License: 暂无描述

Hugging Face2026-03-02 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/McGill-NLP/llm2vec-gen-tulu-w-hard-negative

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: string - name: question dtype: string - name: answer dtype: string - name: negative_question dtype: string - name: negative_answer dtype: string splits: - name: original num_bytes: 2034528644 num_examples: 805467 - name: Qwen3_17B num_bytes: 3660909440 num_examples: 805467 - name: Qwen3_4B num_bytes: 3696200913 num_examples: 805467 - name: Qwen3_8B num_bytes: 3696485153 num_examples: 805467 download_size: 7133393300 dataset_size: 13088124150 configs: - config_name: default data_files: - split: original path: data/original-* - split: Qwen3_17B path: data/Qwen3_17B-* - split: Qwen3_4B path: data/Qwen3_4B-* - split: Qwen3_8B path: data/Qwen3_8B-* --- # LLM2Vec-Gen The dataset consists of generations based on the Tulu-3 SFT data ([https://huggingface.co/datasets/allenai/tulu-3-sft-mixture](allenai/tulu-3-sft-mixture)). These generations are intended to be used for training LLM2Vec-Gen models, serving as the target output for queries. The `negative_question` in this dataset are generated by Gemini. This dataset consists of various splits. Each split corresponds to responses generated by a specific LLM, e.g., Qwen3-4B. The "original" split refers to the original Tulu-3 responses. Each instance in split `M` typically includes: - `id`: The original id. - `question`: The original query. - `answer`: The text generated by the model `M`. - `negative_question`: The negative query generated by Gemini. - `negative_answer`: The text generated by the model `M`. ## Usage You can load the dataset using the Hugging Face datasets library. ``` python from datasets import load_dataset dataset = load_dataset("McGill-NLP/llm2vec-gen-tulu-w-hard-negative", split="Qwen3_4B") ```

提供机构：

McGill-NLP

5,000+

优质数据集

54 个

任务类型

进入经典数据集