ChatParts_Dataset
收藏魔搭社区2025-11-09 更新2024-09-28 收录
下载链接:
https://modelscope.cn/datasets/shellwork/ChatParts_Dataset
下载链接
链接失效反馈官方服务:
资源简介:
## 📚 Dataset Information
This dataset is utilized for fine-tuning the following models:
- [shellwork/ChatParts-llama3.1-8b](https://www.modelscope.cn/datasets/shellwork/ChatParts-llama3.1-8b)
- [shellwork/ChatParts-qwen2.5-14b](https://www.modelscope.cn/datasets/shellwork/ChatParts-qwen2.5-14b)
### 📁 File Structure
The dataset is organized as follows:
```plaintext
D:\ChatParts_Dataset
│
├── README.md
├── Original_data
│ ├── iGEM_competition_web.rar
│ ├── paper_txt_processed.rar
│ └── wiki_data.rar
└── Training_dataset
├── pt_txt.json
├── sft_eval.json
└── sft_train.json
```
- **Original_data:**
- `iGEM_competition_web.rar`: Contains raw text documents scraped from iGEM competition websites.
- `paper_txt_processed.rar`: Contains processed text from over 1,000 synthetic biology review papers.
- `wiki_data.rar`: Contains raw Wikipedia data related to synthetic biology.
The original data was collected using web crawlers and subsequently filtered and manually curated to ensure quality. These raw `.txt` documents serve as the foundational learning passages for the model's pre-training phase. The consolidated and processed text can be found in the `pt_txt.json` file within the `Training_dataset` directory.
- **Training_dataset:**
- `pt_txt.json`: Consolidated and preprocessed text passages used for the model's pre-training step.
- `sft_train.json`: Contains over 180,000 question-answer pairs derived from the original documents, used for supervised fine-tuning (SFT) training.
- `sft_eval.json`: Contains over 20,000 question-answer pairs reserved for evaluating the model post-training, maintaining a 9:1 data ratio compared to the training set.
The `sft_train.json` and `sft_eval.json` files consist of meticulously organized question-answer pairs extracted from all available information in the original documents. These datasets facilitate the model's supervised instruction learning process, enabling it to generate accurate and contextually relevant responses.
### 📄 License
This dataset is released under the **Apache License 2.0**. For more details, please refer to the [license information](https://github.com/shellwork/XJTLU-Software-RAG/tree/main) in the repository.
## 🔗 Additional Resources
- **RAG Software:** Explore the full capabilities of our Retrieval-Augmented Generation software [here](https://github.com/shellwork/XJTLU-Software-RAG/tree/main).
- **Training Data:** Access and review the extensive training dataset [here](https://www.modelscope.cn/datasets/shellwork/ChatParts_Dataset).
---
Feel free to reach out through our GitHub repository for any questions, issues, or contributions related to this dataset.
## 📚 数据集信息
本数据集用于微调以下模型:
- [shellwork/ChatParts-llama3.1-8b](https://www.modelscope.cn/datasets/shellwork/ChatParts-llama3.1-8b)
- [shellwork/ChatParts-qwen2.5-14b](https://www.modelscope.cn/datasets/shellwork/ChatParts-qwen2.5-14b)
### 📁 文件结构
本数据集组织形式如下:
plaintext
D:ChatParts_Dataset
│
├── README.md
├── Original_data
│ ├── iGEM_competition_web.rar
│ ├── paper_txt_processed.rar
│ └── wiki_data.rar
└── Training_dataset
├── pt_txt.json
├── sft_eval.json
└── sft_train.json
- **原始数据(Original_data):**
- `iGEM_competition_web.rar`:包含从iGEM竞赛网站爬取的原始文本文档。
- `paper_txt_processed.rar`:包含超过1000篇合成生物学综述论文的处理后文本。
- `wiki_data.rar`:包含与合成生物学相关的维基百科原始数据。
原始数据通过网络爬虫采集,随后经过筛选与人工审核以保障数据质量。这些原始`.txt`文档作为模型预训练阶段的基础学习语料。整合并预处理后的文本可在`Training_dataset`目录下的`pt_txt.json`文件中获取。
- **训练数据集(Training_dataset):**
- `pt_txt.json`:用于模型预训练步骤的整合预处理文本语料。
- `sft_train.json`:包含超过18万条源自原始文档的问答对,用于监督微调(Supervised Fine-Tuning,SFT)训练。
- `sft_eval.json`:包含超过2万条问答对,用于模型训练后的评估,与训练集的数据比例保持9:1。
`sft_train.json`与`sft_eval.json`文件包含从原始文档所有可用信息中提取的精心整理的问答对。这些数据集助力模型完成监督指令学习流程,使其能够生成准确且符合上下文逻辑的回复。
### 📄 许可证
本数据集采用**Apache许可证2.0(Apache License 2.0)**进行开源发布,更多详情请参阅仓库中的[许可证信息](https://github.com/shellwork/XJTLU-Software-RAG/tree/main)。
## 🔗 额外资源
- **RAG软件**:您可通过[此链接](https://github.com/shellwork/XJTLU-Software-RAG/tree/main)探索我们的检索增强生成(Retrieval-Augmented Generation,RAG)软件的全部功能。
- **训练数据**:您可通过[此链接](https://www.modelscope.cn/datasets/shellwork/ChatParts_Dataset)获取并查阅该大规模训练数据集。
---
若您对本数据集有任何疑问、问题或贡献建议,欢迎通过我们的GitHub仓库联系我们。
提供机构:
maas
创建时间:
2024-09-26



