five

A dataset of temporal signals for web page creation obtained from the Internet Archive's GeoCities end-of-life crawl

收藏
Zenodo2026-04-24 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.19224004
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is obtained by scanning the Internet Archive's GeoCities end-of-life crawl and contains temporal signals extracted from the archived pages. Using a combination of HTML parsing, text extraction, and regular expressions, we collect a range of potential indicators of the page's original creation or last update date. These include explicit expressions such as first posted, last updated, or copyright statements, and dates recorded in HTTP headers. For cases where only the year is found, we interpret four-digit year values as referring to the mid of the year (i.e. first of July).  Since pages may contain multiple temporal indicators, we define an order for selecting a single best guess. Priority is given first to explicit creation dates, then to the earliest copyright year mentioned, followed by last updated dates. This procedure provides an estimate of the creation date obtained from the text of a web page. To store the results, we keep the crawl structure consisting of 149 segments, in which each folder is named after the corresponding segment identifier, e.g. GEOCITIES-20090829030404-00020-ia400131-c. For each input WARC file in a segment we create a corresponding CSV file which lists for every successfully processed WARC record the following information: url: its URL payload_digest: its payload digest record_date_norm : parsed record.headers['WARC-Date'] http_date_norm : parsed HTTP Date header http_last_modified_norm : parsed HTTP Last-Modified header first_posted_raw : "first posted|created" date found using regular expressions, e.g. "2nd of October, 2008" first_posted_norm : and the same date in YYYY-MM-DD format, e.g. 2008-10-02 last_updated_raw : "last updated|modified|revised" date found using regular expressions last_updated_norm : and the same date in YYYY-MM-DD format copyright_years : copyright date found using regular expressions creation_best_guess : computed best guess of the page's original creation or last update date
提供机构:
Zenodo
创建时间:
2026-03-26
二维码
社区交流群
二维码
科研交流群
商业服务