ESVdatalabb/myndighetscrawl
收藏Hugging Face2024-12-07 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/ESVdatalabb/myndighetscrawl
下载链接
链接失效反馈官方服务:
资源简介:
Myndighetscrawl是一个小型项目,旨在通过使用Internet Archive和Common Crawl等开放资源来收集和分析瑞典政府网站上的PDF文件。瑞典政府由大量机构组成,每年发布数千份分析和报告,但由于其分散性和缺乏开放数据实践,这些报告难以找到和整理。Myndighetscrawl项目通过利用这些开放资源的API来提取政府网站上的PDF文件列表,试图解决这个问题。当前可用的数据包括Internet Archive和Common Crawl提供的原始链接列表,分别存储在archive_org.parquet和common_crawl.parquet文件中。项目的下一步计划包括识别高价值文档、过滤重复项以及确定文档的重要元数据。
Myndighetscrawl is a small project to gather and analyse PDF files on Swedish government websites, using several open sources for web crawling such as the Internet Archive and Common Crawl. The project faces the challenge of the vast number of reports and analyses published by Swedish government agencies, which are difficult to find and collect due to their decentralised nature and lack of best open data practices. Myndighetscrawl attempts to solve this problem by using the APIs of these open sources to extract lists of PDF files archived from government websites. The currently available data is the raw list of links provided by the services, including data from the Internet Archive and Common Crawl, each with specific fields describing the archival time, original URL, file size, etc. Future work includes identifying high-value documents, filtering out duplicates, and determining important metadata about the documents.
提供机构:
ESVdatalabb



