Multilingual Scraper of Privacy Policies and Terms of Service

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14562038

下载链接

链接失效反馈

官方服务：

资源简介：

Multilingual Scraper of Privacy Policies and Terms of Service: Scraped Documents of 2024 This dataset supplements publication "Multilingual Scraper of Privacy Policies and Terms of Service" at ACM CSLAW’25, March 25–27, 2025, München, Germany. It includes the first 12 months of scraped policies and terms from about 800k websites, see concrete numbers below. The following table lists the amount of websites visited per month: Month Number of websites 2024-01 551'148 2024-02 792'921 2024-03 844'537 2024-04 802'169 2024-05 805'878 2024-06 809'518 2024-07 811'418 2024-08 813'534 2024-09 814'321 2024-10 817'586 2024-11 828'662 2024-12 827'101 The amount of websites visited should always be higher than the number of jobs (Table 1 of the paper) as a website may redirect, resulting in two websites scraped or it has to be retried. To simplify the access, we release the data in large CSVs. Namely, there is one file for policies and another for terms per month. All of these files contain all metadata that are usable for the analysis. If your favourite CSV parser reports the same numbers as above then our dataset is correctly parsed. We use ‘,’ as a separator, the first row is the heading and strings are in quotes. Since our scraper sometimes collects other documents than policies and terms (for how often this happens, see the evaluation in Sec. 4 of the publication) that might contain personal data such as addresses of authors of websites that they maintain only for a selected audience. We therefore decided to reduce the risks for websites by anonymizing the data using Presidio. Presidio substitutes personal data with tokens . Preliminaries The uncompressed dataset is about 125 GB in size, so you will need sufficient storage. This also means that you likely cannot process all the data at once in your memory, so we split the data in months and in files for policies and terms. Files and structure The files have the following names: 2024__policy.csv for policies 2024__terms.csv for terms Shared metadata Both files contain the following metadata columns: website_month_id - identification of crawled website job_id - one website can have multiple jobs in case of redirects (but most commonly has only one) website_index_status - network state of loading the index page. This is resolved by the Chromed DevTools Protocol. DNS_ERROR - domain cannot be resolved OK - all fine REDIRECT - domain redirect to somewhere else TIMEOUT - the request timed out BAD_CONTENT_TYPE - 415 Unsupported Media Type HTTP_ERROR - 404 error TCP_ERROR - error in the network connection UNKNOWN_ERROR - unknown error website_lang - language of index page detected based on langdetect library website_url - the URL of the website sampled from the CrUX list (may contain subdomains, etc). Use this as a unique identifier for connecting data between months. job_domain_status - indicates the status of loading the index page. Can be: OK - all works well (at the moment, should be all entries) BLACKLISTED - URL is on our list of blocked URLs UNSAFE - website is not safe according to save browsing API by Google LOCATION_BLOCKED - country is in the list of blocked countries job_started_at - when the visit of the website was started job_ended_at - when the visit of the website was ended job_crux_popularity - JSON with all popularity ranks of the website this month job_index_redirect - when we detect that the domain redirects us, we stop the crawl and create a new job with the target URL. This saves time if many websites redirect to one target, as it will be crawled only once. The index_redirect is then the job.id corresponding to the redirect target. job_num_starts - amount of crawlers that started this job (counts restarts in case of unsuccessful crawl, max is 3) job_from_static - whether this job was included in the static selection (see Sec. 3.3 of the paper) job_from_dynamic - whether this job was included in the dynamic selection (see Sec. 3.3 of the paper) - this is not exclusive with from_static - both can be true when the lists overlap. job_crawl_name - our name of the crawl, contains year and month (e.g., 'regular-2024-12' for regular crawls, in Dec 2024) Policy data policy_url_id - ID of the URL this policy has policy_keyword_score - score (higher is better) according to the crawler's keywords list that given document is a policy policy_ml_probability - probability assigned by the BERT model that given document is a policy policy_consideration_basis - on which basis we decided that this url is policy. The following three options are executed by the crawler in this order: 'keyword matching' - this policy was found using the crawler navigation (which is based on keywords) 'search' - this policy was found using search engine 'path guessing' - this policy was found by using well-known URLs like example.com/policy policy_url - full URL to the policy policy_content_hash - used as identifier - if the document remained the same between crawls, it won't create a new entry policy_content - contains the text of policies and terms extracted to Markdown using Mozilla's readability library policy_lang - Language detected by fasttext of the content Terms data Analogous to policy data, just substitute policy to terms. Updates Check this Google Docs for an updated version of this README.md.

# 隐私政策与服务条款多语言爬虫数据集：2024年爬取文档集本数据集为发表于2025年3月25日至27日德国慕尼黑ACM CSLAW会议的《隐私政策与服务条款多语言爬虫》论文提供补充数据。本数据集包含约80万个网站首12个月的爬取隐私政策与服务条款文档，具体数据如下： ### 月度爬取网站数量 | 月份 | 爬取网站数量 | |----------|--------------| | 2024-01 | 551,148 | | 2024-02 | 792,921 | | 2024-03 | 844,537 | | 2024-04 | 802,169 | | 2024-05 | 805,878 | | 2024-06 | 809,518 | | 2024-07 | 811,418 | | 2024-08 | 813,534 | | 2024-09 | 814,321 | | 2024-10 | 817,586 | | 2024-11 | 828,662 | | 2024-12 | 827,101 | 爬取的网站总访问量始终高于作业数（见论文表1），原因在于部分网站可能发生重定向，导致单次爬取对应多个网站，或需要重新发起爬取任务。为简化数据获取流程，本数据集以大体积CSV格式发布。具体而言，每月对应两份文件：一份存储隐私政策数据，另一份存储服务条款数据。所有文件均包含可用于分析的全部元数据。若您使用的CSV解析工具导出的数字与上文一致，则说明数据集已正确解析。本数据集采用逗号作为分隔符，首行为表头，字符串内容以引号包裹。由于爬虫偶尔会爬取隐私政策与服务条款之外的文档（此类情况的发生频率详见论文第4章的评估内容），其中可能包含个人数据，例如网站维护者仅面向特定受众公开的联系地址等信息。为此，我们采用Presidio（Presidio）工具对数据进行匿名化处理，以降低对相关网站的风险。Presidio会将个人数据替换为Token（Token）。 ## 前置须知未压缩的数据集总大小约为125 GB，因此请确保您拥有足够的存储空间。同时，受内存限制，您大概率无法一次性加载全部数据，故我们按月份以及隐私政策、服务条款两类数据对文件进行了拆分。 ## 文件与结构文件命名规则如下： - 隐私政策数据文件：`2024__policy.csv` - 服务条款数据文件：`2024__terms.csv` ## 通用元数据字段两类文件均包含以下元数据列： 1. `website_month_id`：已爬取网站的月度唯一标识 2. `job_id`：作业ID。若网站发生重定向，单个网站可能对应多个作业，但绝大多数情况下仅为1个 3. `website_index_status`：网站首页加载状态，通过Chrome DevTools协议解析得到，可选值包括： - `DNS_ERROR`：域名无法解析 - `OK`：加载正常 - `REDIRECT`：域名发生重定向 - `TIMEOUT`：请求超时 - `BAD_CONTENT_TYPE`：415不支持的媒体类型 - `HTTP_ERROR`：404错误 - `TCP_ERROR`：网络连接错误 - `UNKNOWN_ERROR`：未知错误 4. `website_lang`：基于`langdetect`库检测得到的首页语言 5. `website_url`：从CrUX列表中采样得到的网站URL（可包含子域名等），可作为跨月份关联数据的唯一标识 6. `job_domain_status`：首页加载状态标识，可选值包括： - `OK`：加载正常（目前所有条目均为此状态） - `BLACKLISTED`：URL位于我们的封禁URL列表中 - `UNSAFE`：根据Google安全浏览API判定网站不安全 - `LOCATION_BLOCKED`：网站所在国家位于封禁国家列表中 7. `job_started_at`：网站爬取任务启动时间 8. `job_ended_at`：网站爬取任务结束时间 9. `job_crux_popularity`：包含本月网站所有热度排名的JSON数据 10. `job_index_redirect`：当检测到域名重定向时，我们会终止当前爬取并针对目标URL创建新的作业。若多个网站重定向至同一目标，此举可避免重复爬取，提升效率。该字段值为重定向目标对应的作业ID 11. `job_num_starts`：启动该作业的爬虫实例数（包含爬取失败后的重试次数，最大为3） 12. `job_from_static`：该作业是否来自静态筛选集（详见论文第3.3节） 13. `job_from_dynamic`：该作业是否来自动态筛选集（详见论文第3.3节）。该字段与静态筛选集标识并非互斥关系，若两个筛选列表存在重叠，则两者可同时为真 14. `job_crawl_name`：爬取任务名称，包含年份与月份信息（例如，2024年12月的常规爬取任务命名为`regular-2024-12`） ## 隐私政策数据字段 1. `policy_url_id`：该隐私政策对应的URL ID 2. `policy_keyword_score`：基于爬虫预设关键词列表计算得到的政策匹配得分（得分越高，匹配度越好） 3. `policy_ml_probability`：BERT模型判定该文档为隐私政策的概率值 4. `policy_consideration_basis`：判定该URL对应隐私政策的依据，爬虫按以下优先级依次执行判定： - `keyword matching`：通过基于关键词的爬虫导航流程找到该政策 - `search`：通过搜索引擎检索得到该政策 - `path guessing`：通过预设常见政策路径（例如`example.com/policy`）猜测得到该政策 5. `policy_url`：隐私政策的完整URL 6. `policy_content_hash`：用于去重的标识，若两次爬取的文档内容未发生变化，则不会生成新的条目 7. `policy_content`：使用Mozilla Readability库提取并转换为Markdown格式的隐私政策文本 8. `policy_lang`：基于fasttext库检测得到的文档内容语言 ## 服务条款数据字段服务条款数据字段与隐私政策数据字段结构一致，仅需将所有`policy`替换为`terms`即可。 ## 更新说明如需获取本README.md的最新版本，请查阅此Google Docs文档。

创建时间：

2025-03-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集