five

China Data Element Governance Policy Corpus (2000–2025)

收藏
DataCite Commons2026-04-03 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=8985cdb0e8cb42aba27b17ad7bc08252
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset collects three types of policy texts, namely (1) central policy documents, mainly including programmatic documents issued by the Party Central Committee and the State Council; (2) State Council departmental documents mainly involve guiding policies and relevant opinions issued by various ministries and commissions of the State Council on data element governance and related industries; (3) Provincial and municipal policy documents are issued by the people's governments and their competent departments at the provincial and municipal levels, mainly to guide the development of data elements in the local area. To ensure the comprehensiveness and authority of data collection, this article mainly relies on the policy document library of the State Council on the portal website of the Central People's Government of the People's Republic of China( https://sousuo.www.gov.cn/zcwjk/policyDocumentLibrary )National laws and regulations database( https://flk.npc.gov.cn/index )Obtaining policy texts, both platforms are national level official information sources with high policy coverage and credibility. In addition, this article uses the Beida Baobao V6 database as an auxiliary retrieval tool to identify and fill in gaps, and verify the completeness of policy texts.In terms of search strategy, this paper adopts the title-limited search method to build a search framework of "core keywords + extended keywords." Among them, "data elements" as the core keywords are used to accurately define the research scope, and "data" and "numbers" as expandability keywords are used to supplement the expression differences in different policy contexts, so as to ensure the pertinence of retrieval and improve sample coverage and reduce the risk of omission. Finally, the policy texts related to data element governance issued between 2000 and 2025 were obtained, and the first wave of data collection was completed on November 6, 2025, and the data supplement was completed on April 1, 2026.In terms of data collection method, this paper adopts a comprehensive method combining automatic collection and manual collection, and the collection process follows the way of sub-source collection, cross-check and merger and integration. First of all, for government web pages that support public access, can be downloaded in batches and have relatively regular page structure (such as The State Council policy document library and national laws and regulations library of the portal of the Central People's Government of the People's Republic of China), Python language is used to write web crawler programs, and automatic data collection is realized based on Scrapy framework. According to the pre-set request frequency control (to reduce the risk of being identified by the anti-crawling mechanism) and page parsing rules, the policy list page of the target page is traversed page by page, and the core information such as policy title, publishing authority, publishing date and text content is extracted regularly. In the collection process, according to the page structure difference of different sources of web pages, through the multi-rule parsing mechanism to achieve page structure adaptation, at the same time strictly abide by the website robots protocol, reasonable control of access interval, reduce the risk of website access permission interception, to ensure the standardization and consistency of automatic collection data. Secondly, for web resources with restricted access rights, requiring authentication or using JavaScript dynamic loading mechanism (such as PKU Magic V6 database), manual retrieval and collation are used, and researchers download policy texts one by one and organize information in a structured manner to ensure the complete collection of such restricted resources. Finally, the data set collected by the automatic crawler is merged with the data set collected by manual. Duplicate data are eliminated and collection errors are corrected by the dual methods of deduplication function and manual cross-check, so as to minimize data omission caused by technical limitations and page structure differences, and effectively improve the integrity and accuracy of the original database. Lay a foundation for subsequent data processing.
提供机构:
Science Data Bank
创建时间:
2026-01-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作