llama-3-tulu-v2-sft-subset
收藏魔搭社区2025-08-08 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/llama-3-tulu-v2-sft-subset
下载链接
链接失效反馈官方服务:
资源简介:
Recreating some subsets of [Tulu 2 SFT Mix](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) with Llama 3.1 405B completions from [SambaNova](https://sambanova.ai/)
The following subsets (all generated with GPT-4):
* GPT4-Alpaca
* Open Orca
* Coda Alpaca
Not that the raw dataset includes all the tulu prompts, but some were empty via max_length issues on the API.
We filtered with
```
is_any_content_empty = lambda obj_list: any(not item["content"] for item in obj_list)
```
Here's a basic script for generating this data:
```
import os
import jsonlines
import requests
import json
from datasets import load_dataset
ds = load_dataset("allenai/tulu-v2-sft-mixture")
ds = ds["train"]
outfile = jsonlines.open('tulu2_subset_regenerations.jsonl', 'w')
# code_alpaca open_orca gpt4_alpaca oasst1
counter = 0
for row in ds:
if row["dataset"] == "code_alpaca" or row["dataset"] == "open_orca" or row["dataset"] == "gpt4_alpaca":
counter += 1
if len(row["messages"]) > 2:
print(row)
dialogue = row["messages"]
prompt = dialogue[0]["content"]
max_tokens = 4000 - len(prompt.split())
messages = []
messages.append({"role": "system", "content": "You are a highly efficient assistant. You are to be as fair and accurate"})
messages.append(
{
"role": "user",
"content": prompt
}
)
payload = {"messages": messages, "max_tokens": 4096, "stop": ["[INST", "[INST]", "[/INST]", "[/INST]"], "model": "llama3-405b", "stream": "true"}
key = "your key"
url = "https://7swuc05a91h3zixk.snova.ai/v1/chat/completions"
headers = {"Authorization": f"Basic {key}", "Content-Type": "application/json"}
post_response = requests.post(url, json=payload, headers=headers, stream=True)
response_text = ""
for line in post_response.iter_lines():
if line.startswith(b"data: "):
data_str = line.decode("utf-8")[6:]
try:
line_json = json.loads(data_str)
if "choices" in line_json and "content" in line_json["choices"][0]["delta"]:
try:
response_text += line_json["choices"][0]["delta"]["content"]
except:
breakpoint()
except json.JSONDecodeError as e:
pass
out_dict = {}
out_dict["dataset"] = row["dataset"]
out_dict["id"] = row["id"]
out_dict["regeneration_model"] = "llama3-405b"
out_dict["messages"] = []
out_dict["messages"].append({"role": "user", "content": prompt})
out_dict["messages"].append({"role": "assistant", "content": response_text})
outfile.write(out_dict)
print(response_text)
print(counter)
```
本数据集基于[SambaNova](https://sambanova.ai/)提供的Llama 3.1 405B模型生成的对话补全结果,对[Tulu 2 SFT混合数据集(Tulu 2 SFT Mix)](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture)的部分子集进行了复刻。
本次复刻选取的子集均由GPT-4生成,具体包括:
- GPT4-Alpaca
- Open Orca
- Coda Alpaca
需注意,原始数据集涵盖全部Tulu提示词,但因API存在最大长度限制,部分提示词为空。我们采用如下过滤逻辑:
python
is_any_content_empty = lambda obj_list: any(not item["content"] for item in obj_list)
以下为生成该数据集的基础脚本:
python
import os
import jsonlines
import requests
import json
from datasets import load_dataset
ds = load_dataset("allenai/tulu-v2-sft-mixture")
ds = ds["train"]
outfile = jsonlines.open('tulu2_subset_regenerations.jsonl', 'w')
# code_alpaca open_orca gpt4_alpaca oasst1
counter = 0
for row in ds:
if row["dataset"] == "code_alpaca" or row["dataset"] == "open_orca" or row["dataset"] == "gpt4_alpaca":
counter += 1
if len(row["messages"]) > 2:
print(row)
dialogue = row["messages"]
prompt = dialogue[0]["content"]
max_tokens = 4000 - len(prompt.split())
messages = []
messages.append({"role": "system", "content": "You are a highly efficient assistant. You are to be as fair and accurate"})
messages.append(
{
"role": "user",
"content": prompt
}
)
payload = {"messages": messages, "max_tokens": 4096, "stop": ["[INST", "[INST]", "[/INST]", "[/INST]"], "model": "llama3-405b", "stream": "true"}
key = "your key"
url = "https://7swuc05a91h3zixk.snova.ai/v1/chat/completions"
headers = {"Authorization": f"Basic {key}", "Content-Type": "application/json"}
post_response = requests.post(url, json=payload, headers=headers, stream=True)
response_text = ""
for line in post_response.iter_lines():
if line.startswith(b"data: "):
data_str = line.decode("utf-8")[6:]
try:
line_json = json.loads(data_str)
if "choices" in line_json and "content" in line_json["choices"][0]["delta"]:
try:
response_text += line_json["choices"][0]["delta"]["content"]
except:
breakpoint()
except json.JSONDecodeError as e:
pass
out_dict = {}
out_dict["dataset"] = row["dataset"]
out_dict["id"] = row["id"]
out_dict["regeneration_model"] = "llama3-405b"
out_dict["messages"] = []
out_dict["messages"].append({"role": "user", "content": prompt})
out_dict["messages"].append({"role": "assistant", "content": response_text})
outfile.write(out_dict)
print(response_text)
print(counter)
提供机构:
maas
创建时间:
2025-05-27



