Kiria-Nozan/TRIM-gpt-oss-120b-16-tasks
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Kiria-Nozan/TRIM-gpt-oss-120b-16-tasks
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: "train/*.jsonl"
---
# TRIM Agent Reasoning Messages (HF Public Export)
This directory is a Hugging Face-friendly public export of the TRIM agent reasoning SFT data.
## What Is Included
- Provider: `vllm`
- Model: `gpt-oss-120b`
- Splits present: `train`
- Records in this export manifest: `12040`
- Tasks in this split: `AMES, BBB_Martins, Bioavailability_Ma, CYP2C9_Substrate_CarbonMangels, CYP2D6_Substrate_CarbonMangels, CYP3A4_Substrate_CarbonMangels, Carcinogens_Lagunin, ClinTox, DILI, HIA_Hou, PAMPA_NCATS, Pgp_Broccatelli, SARSCoV2_3CLPro_Diamond, SARSCoV2_Vitro_Touret, Skin_Reaction, hERG`
## Record Schema
Each JSONL line is one training example with these top-level fields:
- `schema_version`
- `task`
- `split`
- `sample_index`
- `sample_id`
- `smiles`
- `gt_label`
- `final_answer_option`
- `messages`
The `messages` field stores a tool-augmented chat transcript, including nested `tool_calls` and the assistant `thinking` text used in the original SFT export.
## Public Sanitization
- Local absolute `source_paths` have been removed from the sample records by default.
- Task-level export metadata is stored under `metadata/manifest.json`.
## Loading Example
```python
from datasets import load_dataset
ds = load_dataset(
"json",
data_files={"train": "train/*.jsonl"},
)
```
---
配置项:
- 配置名称:default
数据文件:
- 数据拆分:train
文件路径:"train/*.jsonl"
---
# TRIM 智能体推理消息(Hugging Face 公开导出版本)
本目录为适配Hugging Face生态的TRIM AI智能体(AI Agent)推理监督微调(Supervised Fine-Tuning,SFT)数据集公开导出文件。
## 包含内容
- 提供方:`vllm`
- 所用模型:`gpt-oss-120b`
- 现有数据拆分:`train`(训练集)
- 本次导出清单内的记录总数:`12040`
- 当前拆分包含的任务:`AMES, BBB_Martins, Bioavailability_Ma, CYP2C9_Substrate_CarbonMangels, CYP2D6_Substrate_CarbonMangels, CYP3A4_Substrate_CarbonMangels, Carcinogens_Lagunin, ClinTox, DILI, HIA_Hou, PAMPA_NCATS, Pgp_Broccatelli, SARSCoV2_3CLPro_Diamond, SARSCoV2_Vitro_Touret, Skin_Reaction, hERG`
## 记录架构
每条JSONL行对应一个训练样本,包含以下顶级字段:
- `schema_version`:架构版本
- `task`:任务名称
- `split`:数据拆分
- `sample_index`:样本索引
- `sample_id`:样本标识符
- `smiles`:简化分子线性输入规范(SMILES)
- `gt_label`:基准标签
- `final_answer_option`:最终答案选项
- `messages`:对话消息
其中`messages`字段用于存储工具增强型聊天会话记录,包含嵌套的`tool_calls`(工具调用)与原始SFT导出中使用的助手`thinking`(思考过程)文本。
## 公开导出脱敏处理
- 默认已移除样本记录中的本地绝对`source_paths`(源路径)信息。
- 任务级导出元数据存储于`metadata/manifest.json`文件中。
## 加载示例
python
from datasets import load_dataset
ds = load_dataset(
"json",
data_files={"train": "train/*.jsonl"},
)
提供机构:
Kiria-Nozan



