Tele-AI/TeleChat-PTD

Name: Tele-AI/TeleChat-PTD
Creator: Tele-AI
Published: 2024-03-20 03:10:49
License: 暂无描述

Hugging Face2024-03-20 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Tele-AI/TeleChat-PTD

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 viewer: false --- <div align="center"> <h1> TeleChat预训练数据集(TeleChat-PTD) </h1> </div> <p align="center"> 🤗 <a href="https://huggingface.co/Tele-AI" target="_blank">Hugging Face</a> • 🏔 <a href="" target="_blank">MindSpore</a>️ • 🦉 <a href="https://github.com/Tele-AI/Telechat" target="_blank">github</a>️ • 🐾 <a href="https://gitee.com/Tele-AI/tele-chat" target="_blank">gitee</a>️ • 💬 <a href="https://github.com/Tele-AI/Telechat/blob/master/images/wechat.jpg" target="_blank">WeChat</a> </p> <p align="center"> <a href="https://arxiv.org/abs/2401.03804" target="_blank"> Tech Report </a> </p> # 数据介绍 TeleChat-PTD 是由电信星辰大模型**TeleChat**预训练语料中抽取出的的综合性大规模中文数据集。数据主要来源于网页、书籍、官方媒体等。我们使用规则+模型的方式进行了相关的过滤，并对数据进行了相似性去重，尽可能地提取出高质量地数据。 TeleChat-PTD 数据集大约公开了2.7亿条数据，数据由纯中文文本构成，原始大小约1TB,压缩后480G，共189个文件。数据集中已经去除了其它冗余信息。 # 数据下载 huggingface下载地址：[数据下载](https://huggingface.co/datasets/Tele-AI/TeleChat-PTD) 天翼云盘下载地址：[数据下载](https://cloud.189.cn/t/ia2QbaVzYf6z)（访问码：pkg8） # 数据格式数据为jsonl格式，仅有一个字段data: 单条处理后的预训练数据 # 数据清洗数据清洗的工作流程主要是：规则筛选和清洗、去重、高质量数据筛选、数据安全处理这四个步骤。 - 规则筛选主要是一些通用的规则和启发式规则，例如对字数长度的筛选等等。 - 去重主要使用相似度去重来将过于相似重复的数据删除 - 高质量筛选主要使用了BERT、GPT2等模型对数据进行打分筛选出高质量数据 - 数据清洗主要是针对不良数据进行了识别和去除。 # 声明、协议、引用 ### 声明我们在此声明，不要使用TeleChat模型及其衍生模型进行任何危害国家社会安全或违法的活动。同时，我们也要求使用者不要将TeleChat模型用于没有安全审查和备案的互联网服务。我们希望所有使用者遵守上述原则，确保科技发展在合法合规的环境下进行。我们已经尽我们所能，来确保模型训练过程中使用的数据的合规性。然而，尽管我们已经做出了巨大的努力，但由于模型和数据的复杂性，仍有可能存在一些无法预见的问题。因此，如果由于使用TeleChat开源模型而导致的任何问题，包括但不限于数据安全问题、公共舆论风险，或模型被误导、滥用、传播或不当利用所带来的任何风险和问题，我们将不承担任何责任。 ### 协议社区使用 TeleChat 模型需要遵循《[TeleChat模型社区许可协议](./TeleChat模型社区许可协议.pdf)》。TeleChat模型支持商业用途，如果您计划将 TeleChat 模型或其衍生品用于商业目的，您需要通过以下联系邮箱 tele_ai@chinatelecom.cn，提交《TeleChat模型社区许可协议》要求的申请材料。审核通过后，将特此授予您一个非排他性、全球性、不可转让、不可再许可、可撤销的商用版权许可。 ### 引用如需引用我们的工作，请使用如下 reference: ``` @misc{wang2024telechat, title={TeleChat Technical Report}, author={Zihan Wang and Xinzhang Liu and Shixuan Liu and Yitong Yao and Yuyao Huang and Zhongjiang He and Xuelong Li and Yongxiang Li and Zhonghao Che and Zhaoxi Zhang and Yan Wang and Xin Wang and Luwen Pu and Huihan Xu and Ruiyu Fang and Yu Zhao and Jie Zhang and Xiaomeng Huang and Zhilong Lu and Jiaxin Peng and Wenjun Zheng and Shiquan Wang and Bingkai Yang and Xuewei he and Zhuoru Jiang and Qiyi Xie and Yanhan Zhang and Zhongqiu Li and Lingling Shi and Weiwei Fu and Yin Zhang and Zilu Huang and Sishi Xiong and Yuxiang Zhang and Chao Wang and Shuangyong Song}, year={2024}, eprint={2401.03804}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

license: apache-2.0 viewer: false <div align="center"> <h1> TeleChat Pretraining Dataset (TeleChat-PTD) </h1> </div> <p align="center"> 🤗 <a href="https://huggingface.co/Tele-AI" target="_blank">Hugging Face</a> • 🏔 <a href="" target="_blank">MindSpore</a>️ • 🦉 <a href="https://github.com/Tele-AI/Telechat" target="_blank">github</a>️ • 🐾 <a href="https://gitee.com/Tele-AI/tele-chat" target="_blank">gitee</a>️ • 💬 <a href="https://github.com/Tele-AI/Telechat/blob/master/images/wechat.jpg" target="_blank">WeChat</a> </p> <p align="center"> <a href="https://arxiv.org/abs/2401.03804" target="_blank">Tech Report</a> </p> # Dataset Introduction TeleChat-PTD is a comprehensive large-scale Chinese dataset extracted from the pretraining corpus of TeleChat, the large language model developed by China Telecom Star. The dataset is mainly sourced from web pages, books, official media, and other resources. We adopted a combination of rule-based and model-based approaches for filtering, and performed similarity deduplication on the data to extract high-quality data as much as possible. TeleChat-PTD publicly releases approximately 270 million pieces of pure Chinese text data. The original size of the dataset is about 1 TB, and it is 480 GB after compression, totaling 189 files. Redundant information has been removed from the dataset. # Data Download Hugging Face download link: [Dataset Download](https://huggingface.co/datasets/Tele-AI/TeleChat-PTD) China Telecom Cloud Disk download link: [Dataset Download](https://cloud.189.cn/t/ia2QbaVzYf6z) (Access Code: pkg8) # Data Format The dataset is stored in jsonl format, with only one field `data` which represents a single processed pretraining sample. # Data Cleaning The data cleaning workflow mainly consists of four steps: rule-based screening and cleaning, deduplication, high-quality data screening, and data security processing. - Rule-based screening mainly uses general and heuristic rules, such as filtering based on text length, etc. - Deduplication uses similarity-based deduplication to remove overly similar and duplicate data. - High-quality data screening uses models such as BERT and GPT2 to score the data and select high-quality samples. - Data security processing mainly identifies and removes harmful, inappropriate, or problematic data. # Declaration, License, and Citation ## Declaration We hereby declare that users shall not use the TeleChat model and its derivative models for any activities that endanger national and social security or violate laws. We also require users not to deploy the TeleChat model for Internet services without security review and filing. We hope that all users abide by the above principles to ensure that technological development proceeds in a legal and compliant environment. We have made every effort to ensure the compliance of the data used in the model training process. However, despite our great efforts, due to the complexity of models and data, some unforeseen problems may still exist. Therefore, we shall not assume any liability for any issues arising from the use of the open-source TeleChat model, including but not limited to data security issues, public opinion risks, or any risks and problems caused by the misdirection, abuse, dissemination, or improper use of the model. ## License Community use of the TeleChat model must comply with the [TeleChat Model Community License Agreement](./TeleChat%E6%A8%A1%E5%9E%8B%E7%A4%BE%E5%8C%BA%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf). The TeleChat model supports commercial usage. If you intend to use the TeleChat model or its derivatives for commercial purposes, you need to submit the application materials required by the TeleChat Model Community License Agreement via the contact email: tele_ai@chinatelecom.cn. Upon passing the review, you will be granted a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable commercial copyright license. ## Citation If you wish to cite our work, please use the following reference: @misc{wang2024telechat, title={"TeleChat Technical Report"}, author={Zihan Wang and Xinzhang Liu and Shixuan Liu and Yitong Yao and Yuyao Huang and Zhongjiang He and Xuelong Li and Yongxiang Li and Zhonghao Che and Zhaoxi Zhang and Yan Wang and Xin Wang and Luwen Pu and Huihan Xu and Ruiyu Fang and Yu Zhao and Jie Zhang and Xiaomeng Huang and Zhilong Lu and Jiaxin Peng and Wenjun Zheng and Shiquan Wang and Bingkai Yang and Xuewei He and Zhuoru Jiang and Qiyi Xie and Yanhan Zhang and Zhongqiu Li and Lingling Shi and Weiwei Fu and Yin Zhang and Zilu Huang and Sishi Xiong and Yuxiang Zhang and Chao Wang and Shuangyong Song}, year={2024}, eprint={2401.03804}, archivePrefix={arXiv}, primaryClass={cs.CL} }

提供机构：

Tele-AI

搜集汇总

数据集介绍

构建方式

TeleChat-PTD数据集的构建基于电信星辰大模型TeleChat的预训练语料，通过规则与模型的双重筛选机制，从网页、书籍及官方媒体等来源中提取出高质量的中文文本。数据清洗过程包括规则筛选、相似性去重、高质量数据筛选及数据安全处理，确保数据集的纯净与有效性。

特点

TeleChat-PTD数据集以其大规模和高质量著称，包含约2.7亿条纯中文文本，原始数据量达1TB，压缩后为480GB，分布于189个文件中。该数据集经过严格筛选与去重，确保每条数据的高质量与独特性，适用于多种自然语言处理任务。

使用方法

TeleChat-PTD数据集以jsonl格式提供，每条数据包含一个字段'data'，记录处理后的预训练文本。用户可通过Hugging Face或天翼云盘下载数据集。使用时需遵守TeleChat模型社区许可协议，确保合法合规。引用时请参考提供的文献格式。

背景与挑战

背景概述

TeleChat-PTD数据集是由电信星辰大模型TeleChat预训练语料中抽取出的综合性大规模中文数据集，创建于2024年。该数据集由主要研究人员王梓涵、刘新章等带领的团队在电信星辰大模型项目中开发，旨在为中文自然语言处理领域提供高质量的预训练数据。TeleChat-PTD数据集的构建不仅解决了中文数据稀缺的问题，还通过规则和模型的双重过滤，确保了数据的高质量和安全性，对推动中文自然语言处理技术的发展具有重要意义。

当前挑战

TeleChat-PTD数据集在构建过程中面临多项挑战。首先，数据来源广泛，包括网页、书籍和官方媒体，如何从海量数据中筛选出高质量的文本是一个复杂的问题。其次，数据去重和相似性处理需要高效的算法支持，以确保数据的唯一性和多样性。此外，数据安全处理和合规性检查也是一大挑战，尤其是在确保数据不包含敏感信息和遵守相关法律法规方面。这些挑战不仅影响了数据集的质量，也对后续的研究和应用提出了更高的要求。

常用场景

经典使用场景

TeleChat-PTD数据集在自然语言处理领域中被广泛应用于预训练模型的构建。其丰富的中文文本资源，涵盖了从网页、书籍到官方媒体的多样化内容，使得该数据集成为训练大规模语言模型的重要基石。通过使用TeleChat-PTD，研究者和开发者能够有效地提升模型的中文理解和生成能力，从而在文本分类、情感分析、机器翻译等多个任务中取得显著效果。

衍生相关工作

基于TeleChat-PTD数据集，研究者们开展了一系列相关工作。例如，有研究利用该数据集训练了高性能的中文文本分类模型，显著提升了分类准确率。同时，也有工作探索了如何利用TeleChat-PTD进行跨语言模型的预训练，以增强模型在多语言环境下的表现。此外，还有研究关注于数据集的进一步优化和扩展，旨在提升数据的质量和多样性，以支持更广泛的自然语言处理任务。

数据集最近研究