IndustryCorpus_mathematics|行业数据集数据集|数学数据集

魔搭社区2025-10-09 更新2024-09-14 收录

行业数据集

数学

下载链接：

https://modelscope.cn/datasets/BAAI/IndustryCorpus_mathematics

下载链接

链接失效反馈

资源简介：

[[中文主页]](README_ZH.md) Industry models play a crucial role in driving enterprise intelligence transformation and innovative development. High-quality industry data is key to improving the performance of large models and realizing industry applications. However, datasets currently used for industry model training generally suffer from issues such as insufficient data volume, low quality, and lack of domain expertise. To address these problems, we constructed and applied 22 industry data processing operators to clean and filter 3.4TB of high-quality multi-industry classified Chinese and English language pre-training datasets from over 100TB of open-source datasets including WuDaoCorpora, BAAI-CCI, redpajama, and SkyPile-150B. The filtered data consists of 1TB of Chinese data and 2.4TB of English data. To facilitate user utilization, we annotated the Chinese data with 12 types of labels including alphanumeric ratio, average line length, language confidence score, maximum line length, and perplexity. Furthermore, to validate the dataset's performance, we conducted continued pre-training, SFT, and DPO training on a medical industry demonstration model. The results showed a 20% improvement in objective performance and a subjective win rate of 82%. Industry categories: 18 categories including medical, education, literature, finance, travel, law, sports, automotive, news, etc. Rule-based filtering: Traditional Chinese conversion, email removal, IP address removal, link removal, Unicode repair, etc. Chinese data labels: Alphanumeric ratio, average line length, language confidence score, maximum line length, perplexity, toxicity character ratio, etc. Model-based filtering: Industry classification language model with 80% accuracy Data deduplication: MinHash document-level deduplication Data size: 1TB Chinese, 2.4TB English Industry classification data size: | Industry Category | Data Size (GB) | Industry Category | Data Size (GB) | | :-------------------:|:----------------:|:-------------------:|:----------------:| | Programming | 4.1 | Politics | 326.4 | | Law | 274.6 | Mathematics | 5.9 | | Education | 458.1 | Sports | 442 | | Finance | 197.8 | Literature | 179.3 | | Computer Science | 46.9 | News | 564.1 | | Technology | 333.6 | Film & TV | 162.1 | | Travel | 82.5 | Medicine | 189.4 | | Agriculture | 41.6 | Automotive | 40.8 | | Emotion | 31.7 | Artificial Intelligence | 5.6 | | Total (GB) | 3386.5 | | | For the convenience of users to download and use, we have split the large dataset into sub-datasets for 18 industries. The current one is the sub-dataset for the mathematics industry. Data processing workflow: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6459c242abdbb77c4c6e1f8e/8okkYsiKvGcU_ssn--vpD.png)

提供机构：

maas

创建时间：

2024-09-12

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4099个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

中国区域交通网络数据集

该数据集包含中国各区域的交通网络信息，包括道路、铁路、航空和水路等多种交通方式的网络结构和连接关系。数据集详细记录了各交通节点的位置、交通线路的类型、长度、容量以及相关的交通流量信息。

data.stats.gov.cn 收录

China Health and Nutrition Survey (CHNS)

China Health and Nutrition Survey（CHNS）是一项由美国北卡罗来纳大学人口中心与中国疾病预防控制中心营养与健康所合作开展的长期开放性队列研究项目，旨在评估国家和地方政府的健康、营养与家庭计划政策对人群健康和营养状况的影响，以及社会经济转型对居民健康行为和健康结果的作用。该调查覆盖中国15个省份和直辖市的约7200户家庭、超过30000名个体，采用多阶段随机抽样方法，收集了家庭、个体以及社区层面的详细数据，包括饮食、健康、经济和社会因素等信息。自2011年起，CHNS不断扩展，新增多个城市和省份，并持续完善纵向数据链接，为研究中国社会经济变化与健康营养的动态关系提供了重要的数据支持。

www.cpc.unc.edu 收录

全国兴趣点（POI）数据

POI（Point of Interest），即兴趣点，一个POI可以是餐厅、超市、景点、酒店、车站、停车场等。兴趣点通常包含四方面信息，分别为名称、类别、坐标、分类。其中，分类一般有一级分类和二级分类，每个分类都有相应的行业的代码和名称一一对应。 POI包含的信息及其衍生信息主要包含三个部分：

CnOpenData 收录

GID(Gaofen Image Dataset)

GID 是具有高分二号（GF-2）卫星图像的大规模土地覆盖数据集。这个新的数据集被命名为高分图像数据集（GID），由于其覆盖范围大、分布广、空间分辨率高，优于现有的土地覆盖数据集。 GID由两部分组成：大规模分类集和精细土地覆盖分类集。大规模分类集包含 150 个像素级标注的 GF-2 图像，精细分类集由 30,000 个多尺度图像块加上 10 个像素级标注的 GF-2 图像组成。分别基于 5 个类别的训练和验证图像收集和重新标记 15 个类别的训练和验证数据。

OpenDataLab 收录

MedTrinity-25M

MedTrinity-25M是由华中科技大学、加州大学圣克鲁兹分校、哈佛大学和斯坦福大学联合创建的一个大规模多模态医学数据集，包含超过2500万张图像，涉及10种模态和65种疾病。数据集通过自动化的数据构建流程生成，不依赖于配对的文本描述，而是通过专家模型和知识库增强的多模态大型语言模型生成多粒度视觉和文本注释。数据集的创建过程包括从90多个在线资源收集数据，应用专家模型识别感兴趣区域（ROIs），并构建知识库以生成详细的文本描述。MedTrinity-25M旨在支持广泛的医学多模态任务，如图像标注和报告生成，以及视觉中心的任务如分类和分割，推动医学领域基础模型的发展。

arXiv 收录