five

ChatParts_Dataset

收藏
魔搭社区2025-11-09 更新2024-09-28 收录
下载链接:
https://modelscope.cn/datasets/shellwork/ChatParts_Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
## 📚 Dataset Information This dataset is utilized for fine-tuning the following models: - [shellwork/ChatParts-llama3.1-8b](https://www.modelscope.cn/datasets/shellwork/ChatParts-llama3.1-8b) - [shellwork/ChatParts-qwen2.5-14b](https://www.modelscope.cn/datasets/shellwork/ChatParts-qwen2.5-14b) ### 📁 File Structure The dataset is organized as follows: ```plaintext D:\ChatParts_Dataset │ ├── README.md ├── Original_data │ ├── iGEM_competition_web.rar │ ├── paper_txt_processed.rar │ └── wiki_data.rar └── Training_dataset ├── pt_txt.json ├── sft_eval.json └── sft_train.json ``` - **Original_data:** - `iGEM_competition_web.rar`: Contains raw text documents scraped from iGEM competition websites. - `paper_txt_processed.rar`: Contains processed text from over 1,000 synthetic biology review papers. - `wiki_data.rar`: Contains raw Wikipedia data related to synthetic biology. The original data was collected using web crawlers and subsequently filtered and manually curated to ensure quality. These raw `.txt` documents serve as the foundational learning passages for the model's pre-training phase. The consolidated and processed text can be found in the `pt_txt.json` file within the `Training_dataset` directory. - **Training_dataset:** - `pt_txt.json`: Consolidated and preprocessed text passages used for the model's pre-training step. - `sft_train.json`: Contains over 180,000 question-answer pairs derived from the original documents, used for supervised fine-tuning (SFT) training. - `sft_eval.json`: Contains over 20,000 question-answer pairs reserved for evaluating the model post-training, maintaining a 9:1 data ratio compared to the training set. The `sft_train.json` and `sft_eval.json` files consist of meticulously organized question-answer pairs extracted from all available information in the original documents. These datasets facilitate the model's supervised instruction learning process, enabling it to generate accurate and contextually relevant responses. ### 📄 License This dataset is released under the **Apache License 2.0**. For more details, please refer to the [license information](https://github.com/shellwork/XJTLU-Software-RAG/tree/main) in the repository. ## 🔗 Additional Resources - **RAG Software:** Explore the full capabilities of our Retrieval-Augmented Generation software [here](https://github.com/shellwork/XJTLU-Software-RAG/tree/main). - **Training Data:** Access and review the extensive training dataset [here](https://www.modelscope.cn/datasets/shellwork/ChatParts_Dataset). --- Feel free to reach out through our GitHub repository for any questions, issues, or contributions related to this dataset.

## 📚 数据集信息 本数据集用于微调以下模型: - [shellwork/ChatParts-llama3.1-8b](https://www.modelscope.cn/datasets/shellwork/ChatParts-llama3.1-8b) - [shellwork/ChatParts-qwen2.5-14b](https://www.modelscope.cn/datasets/shellwork/ChatParts-qwen2.5-14b) ### 📁 文件结构 本数据集组织形式如下: plaintext D:ChatParts_Dataset │ ├── README.md ├── Original_data │ ├── iGEM_competition_web.rar │ ├── paper_txt_processed.rar │ └── wiki_data.rar └── Training_dataset ├── pt_txt.json ├── sft_eval.json └── sft_train.json - **原始数据(Original_data):** - `iGEM_competition_web.rar`:包含从iGEM竞赛网站爬取的原始文本文档。 - `paper_txt_processed.rar`:包含超过1000篇合成生物学综述论文的处理后文本。 - `wiki_data.rar`:包含与合成生物学相关的维基百科原始数据。 原始数据通过网络爬虫采集,随后经过筛选与人工审核以保障数据质量。这些原始`.txt`文档作为模型预训练阶段的基础学习语料。整合并预处理后的文本可在`Training_dataset`目录下的`pt_txt.json`文件中获取。 - **训练数据集(Training_dataset):** - `pt_txt.json`:用于模型预训练步骤的整合预处理文本语料。 - `sft_train.json`:包含超过18万条源自原始文档的问答对,用于监督微调(Supervised Fine-Tuning,SFT)训练。 - `sft_eval.json`:包含超过2万条问答对,用于模型训练后的评估,与训练集的数据比例保持9:1。 `sft_train.json`与`sft_eval.json`文件包含从原始文档所有可用信息中提取的精心整理的问答对。这些数据集助力模型完成监督指令学习流程,使其能够生成准确且符合上下文逻辑的回复。 ### 📄 许可证 本数据集采用**Apache许可证2.0(Apache License 2.0)**进行开源发布,更多详情请参阅仓库中的[许可证信息](https://github.com/shellwork/XJTLU-Software-RAG/tree/main)。 ## 🔗 额外资源 - **RAG软件**:您可通过[此链接](https://github.com/shellwork/XJTLU-Software-RAG/tree/main)探索我们的检索增强生成(Retrieval-Augmented Generation,RAG)软件的全部功能。 - **训练数据**:您可通过[此链接](https://www.modelscope.cn/datasets/shellwork/ChatParts_Dataset)获取并查阅该大规模训练数据集。 --- 若您对本数据集有任何疑问、问题或贡献建议,欢迎通过我们的GitHub仓库联系我们。
提供机构:
maas
创建时间:
2024-09-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作