下载链接：

https://modelscope.cn/datasets/allenai/llama-3-tulu-v2-sft-subset

下载链接

链接失效反馈

官方服务：

资源简介：

Recreating some subsets of [Tulu 2 SFT Mix](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) with Llama 3.1 405B completions from [SambaNova](https://sambanova.ai/) The following subsets (all generated with GPT-4): * GPT4-Alpaca * Open Orca * Coda Alpaca Not that the raw dataset includes all the tulu prompts, but some were empty via max_length issues on the API. We filtered with ``` is_any_content_empty = lambda obj_list: any(not item["content"] for item in obj_list) ``` Here's a basic script for generating this data: ``` import os import jsonlines import requests import json from datasets import load_dataset ds = load_dataset("allenai/tulu-v2-sft-mixture") ds = ds["train"] outfile = jsonlines.open('tulu2_subset_regenerations.jsonl', 'w') # code_alpaca open_orca gpt4_alpaca oasst1 counter = 0 for row in ds: if row["dataset"] == "code_alpaca" or row["dataset"] == "open_orca" or row["dataset"] == "gpt4_alpaca": counter += 1 if len(row["messages"]) > 2: print(row) dialogue = row["messages"] prompt = dialogue[0]["content"] max_tokens = 4000 - len(prompt.split()) messages = [] messages.append({"role": "system", "content": "You are a highly efficient assistant. You are to be as fair and accurate"}) messages.append( { "role": "user", "content": prompt } ) payload = {"messages": messages, "max_tokens": 4096, "stop": ["[INST", "[INST]", "[/INST]", "[/INST]"], "model": "llama3-405b", "stream": "true"} key = "your key" url = "https://7swuc05a91h3zixk.snova.ai/v1/chat/completions" headers = {"Authorization": f"Basic {key}", "Content-Type": "application/json"} post_response = requests.post(url, json=payload, headers=headers, stream=True) response_text = "" for line in post_response.iter_lines(): if line.startswith(b"data: "): data_str = line.decode("utf-8")[6:] try: line_json = json.loads(data_str) if "choices" in line_json and "content" in line_json["choices"][0]["delta"]: try: response_text += line_json["choices"][0]["delta"]["content"] except: breakpoint() except json.JSONDecodeError as e: pass out_dict = {} out_dict["dataset"] = row["dataset"] out_dict["id"] = row["id"] out_dict["regeneration_model"] = "llama3-405b" out_dict["messages"] = [] out_dict["messages"].append({"role": "user", "content": prompt}) out_dict["messages"].append({"role": "assistant", "content": response_text}) outfile.write(out_dict) print(response_text) print(counter) ```

本数据集基于[SambaNova](https://sambanova.ai/)提供的Llama 3.1 405B模型生成的对话补全结果，对[Tulu 2 SFT混合数据集（Tulu 2 SFT Mix）](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture)的部分子集进行了复刻。本次复刻选取的子集均由GPT-4生成，具体包括： - GPT4-Alpaca - Open Orca - Coda Alpaca 需注意，原始数据集涵盖全部Tulu提示词，但因API存在最大长度限制，部分提示词为空。我们采用如下过滤逻辑： python is_any_content_empty = lambda obj_list: any(not item["content"] for item in obj_list) 以下为生成该数据集的基础脚本： python import os import jsonlines import requests import json from datasets import load_dataset ds = load_dataset("allenai/tulu-v2-sft-mixture") ds = ds["train"] outfile = jsonlines.open('tulu2_subset_regenerations.jsonl', 'w') # code_alpaca open_orca gpt4_alpaca oasst1 counter = 0 for row in ds: if row["dataset"] == "code_alpaca" or row["dataset"] == "open_orca" or row["dataset"] == "gpt4_alpaca": counter += 1 if len(row["messages"]) > 2: print(row) dialogue = row["messages"] prompt = dialogue[0]["content"] max_tokens = 4000 - len(prompt.split()) messages = [] messages.append({"role": "system", "content": "You are a highly efficient assistant. You are to be as fair and accurate"}) messages.append( { "role": "user", "content": prompt } ) payload = {"messages": messages, "max_tokens": 4096, "stop": ["[INST", "[INST]", "[/INST]", "[/INST]"], "model": "llama3-405b", "stream": "true"} key = "your key" url = "https://7swuc05a91h3zixk.snova.ai/v1/chat/completions" headers = {"Authorization": f"Basic {key}", "Content-Type": "application/json"} post_response = requests.post(url, json=payload, headers=headers, stream=True) response_text = "" for line in post_response.iter_lines(): if line.startswith(b"data: "): data_str = line.decode("utf-8")[6:] try: line_json = json.loads(data_str) if "choices" in line_json and "content" in line_json["choices"][0]["delta"]: try: response_text += line_json["choices"][0]["delta"]["content"] except: breakpoint() except json.JSONDecodeError as e: pass out_dict = {} out_dict["dataset"] = row["dataset"] out_dict["id"] = row["id"] out_dict["regeneration_model"] = "llama3-405b" out_dict["messages"] = [] out_dict["messages"].append({"role": "user", "content": prompt}) out_dict["messages"].append({"role": "assistant", "content": response_text}) outfile.write(out_dict) print(response_text) print(counter)

应用场景：