Coalition Avenir Québec (CAQ) web archive collection derivatives

NIAID Data Ecosystem2026-03-11 收录

下载链接：

https://zenodo.org/record/3687261

下载链接

链接失效反馈

官方服务：

资源简介：

Web archive derivatives of the Coalition Avenir Québec (CAQ) collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ! These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url Binary Analysis Audio Images PDFs Presentation program files Spreadsheets Text files Videos Word processor files

本数据集为魁北克未来联盟（Coalition Avenir Québec, CAQ）馆藏的网络存档衍生数据集，源自魁北克国家图书馆与档案馆（Bibliothèque et Archives nationales du Québec, BAnQ）。此类衍生文件通过Archives Unleashed Toolkit工具生成，万分感谢BAnQ！该衍生文件采用Apache Parquet格式，此为一种列式存储格式。其体量普遍较小，可在本地设备上便捷处理，且能轻松转换为Pandas数据框（Pandas DataFrames）。相关使用示例可参阅此Notebook文档。 ### 域统计通过`.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)`操作，可生成包含以下字段的数据框： - domain：域名 - count：统计数量 ### 网页数据通过`.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))`操作，可生成包含以下字段的数据框： - crawl_date：爬取日期 - url：统一资源定位符（URL） - mime_type_web_server：Web服务器MIME类型 - mime_type_tika：Tika工具识别的MIME类型 - content：内容（已移除HTTP头与HTML标签） ### 网络图谱数据通过`.webgraph()`操作，可生成包含以下字段的数据框： - crawl_date：爬取日期 - src：源URL - dest：目标URL - anchor：锚文本 ### 图片链接数据通过`.imageLinks()`操作，可生成包含以下字段的数据框： - src：源URL - image_url：图片URL ### 二进制文件分析涵盖以下文件类型： - 音频文件 - 图片文件 - PDF文档 - 演示文稿文件 - 电子表格文件 - 文本文件 - 视频文件 - 文字处理文档

创建时间：

2020-02-26