five

TeleChat-PTD

收藏
魔搭社区2026-05-15 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/TeleAI/TeleChat-PTD
下载链接
链接失效反馈
官方服务:
资源简介:
<div align="center"> <h1> TeleChat预训练数据集(TeleChat-PTD) </h1> </div> <p align="center"> 🤗 <a href="https://huggingface.co/Tele-AI" target="_blank">Hugging Face</a> • 🏔 <a href="" target="_blank">MindSpore</a>️ • 🦉 <a href="https://github.com/Tele-AI/Telechat" target="_blank">github</a>️ • 🐾 <a href="https://gitee.com/Tele-AI/tele-chat" target="_blank">gitee</a>️ • 💬 <a href="https://github.com/Tele-AI/Telechat/blob/master/images/wechat.jpg" target="_blank">WeChat</a> </p> <p align="center"> <a href="https://arxiv.org/abs/2401.03804" target="_blank"> Tech Report </a> </p> # 数据介绍 TeleChat-PTD 是由电信星辰大模型**TeleChat**预训练语料中抽取出的的综合性大规模中文数据集。数据主要来源于网页、书籍、官方媒体等。 我们使用规则+模型的方式进行了相关的过滤,并对数据进行了相似性去重,尽可能地提取出高质量地数据。 TeleChat-PTD 数据集大约公开了2.7亿条数据,数据由纯中文文本构成,原始大小约1TB,压缩后480G,共189个文件。数据集中已经去除了其它冗余信息。 # 数据下载 huggingface下载地址:[数据下载](https://huggingface.co/datasets/Tele-AI/TeleChat-PTD) 天翼云盘下载地址:[数据下载](https://cloud.189.cn/t/ia2QbaVzYf6z)(访问码:pkg8) # 数据格式 数据为jsonl格式,仅有一个字段data: 单条处理后的预训练数据 # 数据清洗 数据清洗的工作流程主要是:规则筛选和清洗、去重、高质量数据筛选、数据安全处理这四个步骤。 - 规则筛选主要是一些通用的规则和启发式规则,例如对字数长度的筛选等等。 - 去重主要使用相似度去重来将过于相似重复的数据删除 - 高质量筛选主要使用了BERT、GPT2等模型对数据进行打分筛选出高质量数据 - 数据清洗主要是针对不良数据进行了识别和去除。 # 声明、协议、引用 ### 声明 我们在此声明,不要使用TeleChat模型及其衍生模型进行任何危害国家社会安全或违法的活动。同时,我们也要求使用者不要将TeleChat模型用于没有安全审查和备案的互联网服务。我们希望所有使用者遵守上述原则,确保科技发展在合法合规的环境下进行。 我们已经尽我们所能,来确保模型训练过程中使用的数据的合规性。然而,尽管我们已经做出了巨大的努力,但由于模型和数据的复杂性,仍有可能存在一些无法预见的问题。因此,如果由于使用TeleChat开源模型而导致的任何问题,包括但不限于数据安全问题、公共舆论风险,或模型被误导、滥用、传播或不当利用所带来的任何风险和问题,我们将不承担任何责任。 ### 协议 社区使用 TeleChat 模型需要遵循《[TeleChat模型社区许可协议](./TeleChat模型社区许可协议.pdf)》。TeleChat模型支持商业用途,如果您计划将 TeleChat 模型或其衍生品用于商业目的,您需要通过以下联系邮箱 tele_ai@chinatelecom.cn,提交《TeleChat模型社区许可协议》要求的申请材料。审核通过后,将特此授予您一个非排他性、全球性、不可转让、不可再许可、可撤销的商用版权许可。 ### 引用 如需引用我们的工作,请使用如下 reference: ``` @misc{wang2024telechat, title={TeleChat Technical Report}, author={Zihan Wang and Xinzhang Liu and Shixuan Liu and Yitong Yao and Yuyao Huang and Zhongjiang He and Xuelong Li and Yongxiang Li and Zhonghao Che and Zhaoxi Zhang and Yan Wang and Xin Wang and Luwen Pu and Huihan Xu and Ruiyu Fang and Yu Zhao and Jie Zhang and Xiaomeng Huang and Zhilong Lu and Jiaxin Peng and Wenjun Zheng and Shiquan Wang and Bingkai Yang and Xuewei he and Zhuoru Jiang and Qiyi Xie and Yanhan Zhang and Zhongqiu Li and Lingling Shi and Weiwei Fu and Yin Zhang and Zilu Huang and Sishi Xiong and Yuxiang Zhang and Chao Wang and Shuangyong Song}, year={2024}, eprint={2401.03804}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

<div align="center"> <h1> TeleChat Pre-training Dataset (TeleChat-PTD) </h1> </div> <p align="center"> 🤗 <a href="https://huggingface.co/Tele-AI" target="_blank">Hugging Face</a> • 🏔 <a href="" target="_blank">MindSpore</a>️ • 🦉 <a href="https://github.com/Tele-AI/Telechat" target="_blank">github</a>️ • 🐾 <a href="https://gitee.com/Tele-AI/tele-chat" target="_blank">gitee</a>️ • 💬 <a href="https://github.com/Tele-AI/Telechat/blob/master/images/wechat.jpg" target="_blank">WeChat</a> </p> <p align="center"> <a href="https://arxiv.org/abs/2401.03804" target="_blank">Tech Report</a> </p> # Data Introduction TeleChat-PTD is a comprehensive large-scale Chinese dataset extracted from the pre-training corpus of TeleChat, the large language model developed by China Telecom's Tele-AI team. The data mainly originates from web pages, books, official media, and other sources. We utilized a combination of rule-based and model-based methods for filtering, and performed similarity deduplication on the data to extract high-quality samples as much as possible. The publicly released TeleChat-PTD dataset contains approximately 270 million pieces of pure Chinese text data. The original uncompressed size is about 1 TB, and it is 480 GB after compression, split into 189 files. All redundant information has been removed from the dataset. # Data Download Hugging Face download link: <a href="https://huggingface.co/datasets/Tele-AI/TeleChat-PTD" target="_blank">Download</a> Tianyi Cloud Disk download link: <a href="https://cloud.189.cn/t/ia2QbaVzYf6z" target="_blank">Download</a> (access code: pkg8) # Data Format The dataset is stored in jsonl format, with only one field `data` which contains a single processed pre-training sample. # Data Cleaning The data cleaning workflow mainly consists of four steps: rule-based screening and cleaning, deduplication, high-quality data screening, and data security processing. - Rule-based screening mainly uses general and heuristic rules, such as filtering based on text length, etc. - Deduplication: Similarity-based deduplication is applied to remove overly similar and duplicated data. - High-quality data screening: Models such as BERT and GPT2 are used to score and select high-quality data. - Data security processing: Identifies and removes harmful or inappropriate content. # Disclaimer, License and Citation ## Disclaimer We hereby declare that users shall not use the TeleChat model or its derivative models for any activities that endanger national or social security, or violate laws and regulations. We also require users not to deploy the TeleChat model on internet services that have not undergone security review and filing. We hope all users will abide by the above principles to ensure that technological development proceeds in a legal and compliant environment. We have made every effort to ensure the compliance of the data used in the model training process. However, despite our great efforts, due to the complexity of models and data, some unforeseen issues may still exist. Therefore, we shall not be liable for any problems caused by the use of the open-source TeleChat model, including but not limited to data security issues, public opinion risks, or any risks and problems arising from the misdirection, abuse, dissemination, or improper use of the model. ## License Community use of the TeleChat model must comply with the *<a href="./TeleChat模型社区许可协议.pdf" target="_blank">TeleChat Model Community License Agreement</a>*. The TeleChat model allows commercial use. If you intend to use the TeleChat model or its derivatives for commercial purposes, you need to submit the application materials required by the TeleChat Model Community License Agreement via the contact email: tele_ai@chinatelecom.cn. Upon approval, you will be granted a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable commercial copyright license. ## Citation If you wish to cite our work, please use the following reference: @misc{wang2024telechat, title={TeleChat Technical Report}, author={Zihan Wang and Xinzhang Liu and Shixuan Liu and Yitong Yao and Yuyao Huang and Zhongjiang He and Xuelong Li and Yongxiang Li and Zhonghao Che and Zhaoxi Zhang and Yan Wang and Xin Wang and Luwen Pu and Huihan Xu and Ruiyu Fang and Yu Zhao and Jie Zhang and Xiaomeng Huang and Zhilong Lu and Jiaxin Peng and Wenjun Zheng and Shiquan Wang and Bingkai Yang and Xuewei he and Zhuoru Jiang and Qiyi Xie and Yanhan Zhang and Zhongqiu Li and Lingling Shi and Weiwei Fu and Yin Zhang and Zilu Huang and Sishi Xiong and Yuxiang Zhang and Chao Wang and Shuangyong Song}, year={2024}, eprint={2401.03804}, archivePrefix={arXiv}, primaryClass={cs.CL} }
提供机构:
maas
创建时间:
2024-03-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作