agentlans/thomas-yanxin-MT-SFT-ShareGPT-sample
收藏Hugging Face2025-12-13 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/agentlans/thomas-yanxin-MT-SFT-ShareGPT-sample
下载链接
链接失效反馈官方服务:
资源简介:
该数据集提供了[thomas-yanxin/MT-SFT-ShareGPT](https://huggingface.co/datasets/thomas-yanxin/MT-SFT-ShareGPT)数据集的样本,包含英文和中文子集。数据集内容包括:`train.jsonl`(原始数据的1/10,经过洗牌)、`EN.jsonl`(来自`train.jsonl`的英文对话)和`ZH.jsonl`(来自`train.jsonl`的中文对话)。每行代表一个对话,包含可选的系统消息,随后是人类和GPT的对话轮次。原始数据集的列被保留,缺失的列用null值填充。`EN_prompt_downsample`包含带有难度评级的英文提示,中等难度的行限制为70,000行以解决类别不平衡问题。数据按80%训练/20%测试分割,用于模型性能评估和优化。已知问题:在某些数学数据集中,部分多位数数字因类似电话或信用卡号而被编辑,需要时可排除这些行。许可证为Apache 2.0。
This dataset provides a sample of the [thomas-yanxin/MT-SFT-ShareGPT](https://huggingface.co/datasets/thomas-yanxin/MT-SFT-ShareGPT) dataset with English and Chinese subsets. The dataset contents include: `train.jsonl` (1/10 of the original data, shuffled), `EN.jsonl` (English conversations from `train.jsonl`), and `ZH.jsonl` (Chinese conversations from `train.jsonl`). Each row represents a conversation with an optional system message, followed by human and GPT turns. Columns from the original dataset are preserved, with null values for missing columns. `EN_prompt_downsample` contains English prompts with difficulty ratings, with medium difficulty rows limited to 70,000 rows to address class imbalance. The data is split 80% training / 20% test for model performance evaluation and optimization. Known issue: In some math datasets, certain multi-digit numbers have been redacted because they resemble phone or credit card numbers. Exclude those rows if needed. License: Apache 2.0.
提供机构:
agentlans



