数智有为—高质量、权威大模型中文报纸语料库

Name: 数智有为—高质量、权威大模型中文报纸语料库
Creator: 北京数智有为科技有限公司
License: 暂无描述

北京国际大数据交易所2025-04-17 收录

下载链接：

https://webs.bjidex.com/sys-bsc-home/#/bscConsole/tradingMarket/detail?id=4325

下载链接

链接失效反馈

官方服务：

资源简介：

“数智有为—高质量、权威大模型中文报纸语料库”是一款面向未来，结合来自权威报纸媒体的可靠内容、大模型、数字人、AIGC技术，旨在为通用大模型机构的训练提供高质量语料库，并为大学图书馆及公共图书馆、科研机构、竞争情报分析及政策研究者提供高效阅读检索分析的数据智能服务。产品背景人工智能是发展新质生产力的重要引擎、是引领未来的战略性技术。大模型的发展遵循规模增长定律，即算力、数据等资源的规模越大、质量越高，大模型的智能水平就越强，高质量的数据集对于大模型企业的发展至关重要。面对大模型对高质量数据集的迫切需求，北京数智有为科技有限公司充分发挥数据资源丰富、人工智能技术发展领先的优势，开展了数据基础获取、积累、管理和挖掘的有益探索。经过多年的技术探索和数据积累，于2024年4月23日正式上线的人工智能大模型语料库包含来自权威报纸媒体经过三审三校的高质量数据，内容为党政类、事实类时事报道、公告类报纸新闻为主，数据源不涉及需要获取第三方授权的内容，全部可溯来源，经过精准标引，横向覆盖近千种报纸内容，纵向回溯三十余年历史数据，数据集容量高达100TB，其中纯文本数据容量1TB，可为大模型训练机构、大学图书馆及公共图书馆、科研机构、竞争情报分析、政策研究者提供高效阅读及检索分析的数据智能服务。

"Shuzhi Youwei — High-Quality and Authoritative Large Language Model (LLM) Chinese Newspaper Corpus" is a future-oriented product that integrates reliable content from authoritative newspaper media, large language models, digital humans, and AIGC technologies. It aims to provide high-quality corpora for the training of general large language model institutions, and deliver data intelligent services supporting efficient reading, retrieval and analysis for university libraries, public libraries, scientific research institutions, competitive intelligence analysts and policy researchers. Product Background: Artificial intelligence is a critical engine for developing new productive forces and a strategic technology leading the future. The development of large language models follows the law of scale growth: the larger the scale and higher the quality of resources such as computing power and data, the stronger the intelligent capabilities of large models. Therefore, high-quality datasets are extremely crucial for the development of large model enterprises. Facing the urgent demand of large language models for high-quality datasets, Beijing Shuzhi Youwei Technology Co., Ltd. has conducted beneficial explorations on data acquisition, accumulation, management and mining by leveraging its advantages of abundant data resources and leading artificial intelligence technologies. After years of technological exploration and data accumulation, the AI large language model corpus was officially launched on April 23, 2024. It contains high-quality data from authoritative newspaper media that has undergone three reviews and three revisions, with its core content covering party and government-related reports, factual current affairs coverage, and announcement-style newspaper news. The data sources do not involve any content requiring third-party authorization, and all sources are traceable. The corpus has been accurately indexed, horizontally covering contents from nearly 1,000 types of newspapers and vertically tracing back over 30 years of historical data. The total capacity of the dataset reaches 100TB, including 1TB of plain text data. It can provide data intelligent services supporting efficient reading, retrieval and analysis for large model training institutions, university libraries, public libraries, scientific research institutions, competitive intelligence analysts and policy researchers.

提供机构：

北京数智有为科技有限公司

搜集汇总

数据集介绍

背景与挑战

背景概述

该语料库包含来自近千种权威报纸的党政、事实类新闻数据，时间跨度达30余年，总量100TB（纯文本1TB），所有内容均经过三审三校和精准标引。主要服务于大模型训练机构、科研单位及政策研究者，提供可溯源的优质中文语料支持。

以上内容由遇见数据集搜集并总结生成