five

IndustryCorpus_programming

收藏
魔搭社区2026-01-10 更新2024-09-14 收录
下载链接:
https://modelscope.cn/datasets/BAAI/IndustryCorpus_programming
下载链接
链接失效反馈
官方服务:
资源简介:
[[中文主页]](README_ZH.md) Industry models play a crucial role in driving enterprise intelligence transformation and innovative development. High-quality industry data is key to improving the performance of large models and realizing industry applications. However, datasets currently used for industry model training generally suffer from issues such as insufficient data volume, low quality, and lack of domain expertise. To address these problems, we constructed and applied 22 industry data processing operators to clean and filter 3.4TB of high-quality multi-industry classified Chinese and English language pre-training datasets from over 100TB of open-source datasets including WuDaoCorpora, BAAI-CCI, redpajama, and SkyPile-150B. The filtered data consists of 1TB of Chinese data and 2.4TB of English data. To facilitate user utilization, we annotated the Chinese data with 12 types of labels including alphanumeric ratio, average line length, language confidence score, maximum line length, and perplexity. Furthermore, to validate the dataset's performance, we conducted continued pre-training, SFT, and DPO training on a medical industry demonstration model. The results showed a 20% improvement in objective performance and a subjective win rate of 82%. Industry categories: 18 categories including medical, education, literature, finance, travel, law, sports, automotive, news, etc. Rule-based filtering: Traditional Chinese conversion, email removal, IP address removal, link removal, Unicode repair, etc. Chinese data labels: Alphanumeric ratio, average line length, language confidence score, maximum line length, perplexity, toxicity character ratio, etc. Model-based filtering: Industry classification language model with 80% accuracy Data deduplication: MinHash document-level deduplication Data size: 1TB Chinese, 2.4TB English Industry classification data size: | Industry Category | Data Size (GB) | Industry Category | Data Size (GB) | | :-------------------:|:----------------:|:-------------------:|:----------------:| | Programming | 4.1 | Politics | 326.4 | | Law | 274.6 | Mathematics | 5.9 | | Education | 458.1 | Sports | 442 | | Finance | 197.8 | Literature | 179.3 | | Computer Science | 46.9 | News | 564.1 | | Technology | 333.6 | Film & TV | 162.1 | | Travel | 82.5 | Medicine | 189.4 | | Agriculture | 41.6 | Automotive | 40.8 | | Emotion | 31.7 | Artificial Intelligence | 5.6 | | Total (GB) | 3386.5 | | | For the convenience of users to download and use, we have split the large dataset into sub-datasets for 18 industries. The current one is the sub-dataset for the programming industry. Data processing workflow: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6459c242abdbb77c4c6e1f8e/8okkYsiKvGcU_ssn--vpD.png)

[[中文主页]](README_ZH.md) 行业大模型是驱动企业智能化转型与创新发展的核心支撑。高质量行业数据是提升大语言模型(Large Language Model,LLM)性能、实现行业落地应用的关键要素。然而当前用于行业模型训练的数据集普遍存在数据体量不足、质量参差不齐、领域专业知识匮乏等共性问题。 为破解上述痛点,我们构建并应用了22款行业数据处理算子,从包含悟道语料库(WuDaoCorpora)、北京智源中文语料库(BAAI-CCI)、RedPajama、SkyPile-150B在内的超100TB开源数据集集群中,清洗筛选出3.4TB高质量多行业分类中英双语预训练数据集。经筛选后的数据集包含1TB中文数据与2.4TB英文数据。为便于用户使用,我们为中文数据标注了12类标签,涵盖字符数字占比、平均行长度、语言置信度得分、最大行长度以及困惑度(perplexity)等维度。 此外,为验证该数据集的实际效能,我们针对医疗行业示范模型开展了持续预训练、监督微调(Supervised Fine-Tuning,SFT)以及直接偏好优化(Direct Preference Optimization,DPO)训练。实验结果显示,模型客观性能提升20%,主观胜率达82%。 行业分类:涵盖医疗、教育、文学、金融、旅游、法律、体育、汽车、新闻等共18个类别。 基于规则的过滤流程:包含繁体中文转换、邮箱地址移除、IP地址移除、链接移除、Unicode修复等操作。 中文数据标注维度:字符数字占比、平均行长度、语言置信度得分、最大行长度、困惑度、毒性字符占比等。 基于模型的过滤流程:采用准确率达80%的行业分类语言模型完成过滤。 数据去重:采用MinHash实现文档级去重。 数据集总规模:中文数据1TB,英文数据2.4TB。 行业分类数据规模: | 行业分类 | 数据规模(GB) | 行业分类 | 数据规模(GB) | | :-------------------:|:----------------:|:-------------------:|:----------------:| | 编程 | 4.1 | 政治 | 326.4 | | 法律 | 274.6 | 数学 | 5.9 | | 教育 | 458.1 | 体育 | 442 | | 金融 | 197.8 | 文学 | 179.3 | | 计算机科学 | 46.9 | 新闻 | 564.1 | | 科技 | 333.6 | 影视 | 162.1 | | 旅游 | 82.5 | 医学 | 189.4 | | 农业 | 41.6 | 汽车 | 40.8 | | 情感 | 31.7 | 人工智能 | 5.6 | | 总计(GB) | 3386.5 | | | 为方便用户下载与使用,我们将整体数据集拆分为18个行业的子数据集,当前开放的为编程行业子数据集。 数据处理流程: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6459c242abdbb77c4c6e1f8e/8okkYsiKvGcU_ssn--vpD.png)
提供机构:
maas
创建时间:
2024-09-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作