five

llama-3-tulu-v2-sft-subset

收藏
魔搭社区2025-08-08 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/llama-3-tulu-v2-sft-subset
下载链接
链接失效反馈
官方服务:
资源简介:
Recreating some subsets of [Tulu 2 SFT Mix](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) with Llama 3.1 405B completions from [SambaNova](https://sambanova.ai/) The following subsets (all generated with GPT-4): * GPT4-Alpaca * Open Orca * Coda Alpaca Not that the raw dataset includes all the tulu prompts, but some were empty via max_length issues on the API. We filtered with ``` is_any_content_empty = lambda obj_list: any(not item["content"] for item in obj_list) ``` Here's a basic script for generating this data: ``` import os import jsonlines import requests import json from datasets import load_dataset ds = load_dataset("allenai/tulu-v2-sft-mixture") ds = ds["train"] outfile = jsonlines.open('tulu2_subset_regenerations.jsonl', 'w') # code_alpaca open_orca gpt4_alpaca oasst1 counter = 0 for row in ds: if row["dataset"] == "code_alpaca" or row["dataset"] == "open_orca" or row["dataset"] == "gpt4_alpaca": counter += 1 if len(row["messages"]) > 2: print(row) dialogue = row["messages"] prompt = dialogue[0]["content"] max_tokens = 4000 - len(prompt.split()) messages = [] messages.append({"role": "system", "content": "You are a highly efficient assistant. You are to be as fair and accurate"}) messages.append( { "role": "user", "content": prompt } ) payload = {"messages": messages, "max_tokens": 4096, "stop": ["[INST", "[INST]", "[/INST]", "[/INST]"], "model": "llama3-405b", "stream": "true"} key = "your key" url = "https://7swuc05a91h3zixk.snova.ai/v1/chat/completions" headers = {"Authorization": f"Basic {key}", "Content-Type": "application/json"} post_response = requests.post(url, json=payload, headers=headers, stream=True) response_text = "" for line in post_response.iter_lines(): if line.startswith(b"data: "): data_str = line.decode("utf-8")[6:] try: line_json = json.loads(data_str) if "choices" in line_json and "content" in line_json["choices"][0]["delta"]: try: response_text += line_json["choices"][0]["delta"]["content"] except: breakpoint() except json.JSONDecodeError as e: pass out_dict = {} out_dict["dataset"] = row["dataset"] out_dict["id"] = row["id"] out_dict["regeneration_model"] = "llama3-405b" out_dict["messages"] = [] out_dict["messages"].append({"role": "user", "content": prompt}) out_dict["messages"].append({"role": "assistant", "content": response_text}) outfile.write(out_dict) print(response_text) print(counter) ```

本数据集基于[SambaNova](https://sambanova.ai/)提供的Llama 3.1 405B模型生成的对话补全结果,对[Tulu 2 SFT混合数据集(Tulu 2 SFT Mix)](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture)的部分子集进行了复刻。 本次复刻选取的子集均由GPT-4生成,具体包括: - GPT4-Alpaca - Open Orca - Coda Alpaca 需注意,原始数据集涵盖全部Tulu提示词,但因API存在最大长度限制,部分提示词为空。我们采用如下过滤逻辑: python is_any_content_empty = lambda obj_list: any(not item["content"] for item in obj_list) 以下为生成该数据集的基础脚本: python import os import jsonlines import requests import json from datasets import load_dataset ds = load_dataset("allenai/tulu-v2-sft-mixture") ds = ds["train"] outfile = jsonlines.open('tulu2_subset_regenerations.jsonl', 'w') # code_alpaca open_orca gpt4_alpaca oasst1 counter = 0 for row in ds: if row["dataset"] == "code_alpaca" or row["dataset"] == "open_orca" or row["dataset"] == "gpt4_alpaca": counter += 1 if len(row["messages"]) > 2: print(row) dialogue = row["messages"] prompt = dialogue[0]["content"] max_tokens = 4000 - len(prompt.split()) messages = [] messages.append({"role": "system", "content": "You are a highly efficient assistant. You are to be as fair and accurate"}) messages.append( { "role": "user", "content": prompt } ) payload = {"messages": messages, "max_tokens": 4096, "stop": ["[INST", "[INST]", "[/INST]", "[/INST]"], "model": "llama3-405b", "stream": "true"} key = "your key" url = "https://7swuc05a91h3zixk.snova.ai/v1/chat/completions" headers = {"Authorization": f"Basic {key}", "Content-Type": "application/json"} post_response = requests.post(url, json=payload, headers=headers, stream=True) response_text = "" for line in post_response.iter_lines(): if line.startswith(b"data: "): data_str = line.decode("utf-8")[6:] try: line_json = json.loads(data_str) if "choices" in line_json and "content" in line_json["choices"][0]["delta"]: try: response_text += line_json["choices"][0]["delta"]["content"] except: breakpoint() except json.JSONDecodeError as e: pass out_dict = {} out_dict["dataset"] = row["dataset"] out_dict["id"] = row["id"] out_dict["regeneration_model"] = "llama3-405b" out_dict["messages"] = [] out_dict["messages"].append({"role": "user", "content": prompt}) out_dict["messages"].append({"role": "assistant", "content": response_text}) outfile.write(out_dict) print(response_text) print(counter)
提供机构:
maas
创建时间:
2025-05-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作