ChatParts_Dataset

Name: ChatParts_Dataset
Creator: maas
Published: 2025-11-09 22:22:43
License: 暂无描述

魔搭社区2025-11-09 更新2024-09-28 收录

下载链接：

https://modelscope.cn/datasets/shellwork/ChatParts_Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

## 📚 Dataset Information This dataset is utilized for fine-tuning the following models: - [shellwork/ChatParts-llama3.1-8b](https://www.modelscope.cn/datasets/shellwork/ChatParts-llama3.1-8b) - [shellwork/ChatParts-qwen2.5-14b](https://www.modelscope.cn/datasets/shellwork/ChatParts-qwen2.5-14b) ### 📁 File Structure The dataset is organized as follows: ```plaintext D:\ChatParts_Dataset │ ├── README.md ├── Original_data │ ├── iGEM_competition_web.rar │ ├── paper_txt_processed.rar │ └── wiki_data.rar └── Training_dataset ├── pt_txt.json ├── sft_eval.json └── sft_train.json ``` - **Original_data:** - `iGEM_competition_web.rar`: Contains raw text documents scraped from iGEM competition websites. - `paper_txt_processed.rar`: Contains processed text from over 1,000 synthetic biology review papers. - `wiki_data.rar`: Contains raw Wikipedia data related to synthetic biology. The original data was collected using web crawlers and subsequently filtered and manually curated to ensure quality. These raw `.txt` documents serve as the foundational learning passages for the model's pre-training phase. The consolidated and processed text can be found in the `pt_txt.json` file within the `Training_dataset` directory. - **Training_dataset:** - `pt_txt.json`: Consolidated and preprocessed text passages used for the model's pre-training step. - `sft_train.json`: Contains over 180,000 question-answer pairs derived from the original documents, used for supervised fine-tuning (SFT) training. - `sft_eval.json`: Contains over 20,000 question-answer pairs reserved for evaluating the model post-training, maintaining a 9:1 data ratio compared to the training set. The `sft_train.json` and `sft_eval.json` files consist of meticulously organized question-answer pairs extracted from all available information in the original documents. These datasets facilitate the model's supervised instruction learning process, enabling it to generate accurate and contextually relevant responses. ### 📄 License This dataset is released under the **Apache License 2.0**. For more details, please refer to the [license information](https://github.com/shellwork/XJTLU-Software-RAG/tree/main) in the repository. ## 🔗 Additional Resources - **RAG Software:** Explore the full capabilities of our Retrieval-Augmented Generation software [here](https://github.com/shellwork/XJTLU-Software-RAG/tree/main). - **Training Data:** Access and review the extensive training dataset [here](https://www.modelscope.cn/datasets/shellwork/ChatParts_Dataset). --- Feel free to reach out through our GitHub repository for any questions, issues, or contributions related to this dataset.

## 📚 数据集信息本数据集用于微调以下模型： - [shellwork/ChatParts-llama3.1-8b](https://www.modelscope.cn/datasets/shellwork/ChatParts-llama3.1-8b) - [shellwork/ChatParts-qwen2.5-14b](https://www.modelscope.cn/datasets/shellwork/ChatParts-qwen2.5-14b) ### 📁 文件结构本数据集组织形式如下： plaintext D:ChatParts_Dataset │ ├── README.md ├── Original_data │ ├── iGEM_competition_web.rar │ ├── paper_txt_processed.rar │ └── wiki_data.rar └── Training_dataset ├── pt_txt.json ├── sft_eval.json └── sft_train.json - **原始数据（Original_data）：** - `iGEM_competition_web.rar`：包含从iGEM竞赛网站爬取的原始文本文档。 - `paper_txt_processed.rar`：包含超过1000篇合成生物学综述论文的处理后文本。 - `wiki_data.rar`：包含与合成生物学相关的维基百科原始数据。原始数据通过网络爬虫采集，随后经过筛选与人工审核以保障数据质量。这些原始`.txt`文档作为模型预训练阶段的基础学习语料。整合并预处理后的文本可在`Training_dataset`目录下的`pt_txt.json`文件中获取。 - **训练数据集（Training_dataset）：** - `pt_txt.json`：用于模型预训练步骤的整合预处理文本语料。 - `sft_train.json`：包含超过18万条源自原始文档的问答对，用于监督微调（Supervised Fine-Tuning，SFT）训练。 - `sft_eval.json`：包含超过2万条问答对，用于模型训练后的评估，与训练集的数据比例保持9:1。 `sft_train.json`与`sft_eval.json`文件包含从原始文档所有可用信息中提取的精心整理的问答对。这些数据集助力模型完成监督指令学习流程，使其能够生成准确且符合上下文逻辑的回复。 ### 📄 许可证本数据集采用**Apache许可证2.0（Apache License 2.0）**进行开源发布，更多详情请参阅仓库中的[许可证信息](https://github.com/shellwork/XJTLU-Software-RAG/tree/main)。 ## 🔗 额外资源 - **RAG软件**：您可通过[此链接](https://github.com/shellwork/XJTLU-Software-RAG/tree/main)探索我们的检索增强生成（Retrieval-Augmented Generation，RAG）软件的全部功能。 - **训练数据**：您可通过[此链接](https://www.modelscope.cn/datasets/shellwork/ChatParts_Dataset)获取并查阅该大规模训练数据集。 --- 若您对本数据集有任何疑问、问题或贡献建议，欢迎通过我们的GitHub仓库联系我们。

提供机构：

maas

创建时间：

2024-09-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集