five

A dataset of Data Subject Access Request Packages

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11634937
下载链接
链接失效反馈
官方服务:
资源简介:
Overview This dataset is a minimal example of Data Subject Access Request Packages (SARPs), as they can be retrieved under data protection laws, specifically the GDPR. It includes data from two data subjects, each with accounts for five major sevices, namely Amazon, Apple, Facebook, Google, and Linkedin.   Purpose and Usage This dataset is meant to be an initial dataset that allows for manual exploration of structures and contents found in SARPs. Hence, the number of controllers and user profiles should be minimal but sufficient to allow cross-subject and cross-controller analysis. This dataset can be used to explore structures, formats and data types found in real-world SARPs. Thereby, the planning of future SARP-based research projects and studies shall be facilitated.We invite other researchers to use this dataset to explore the structure of SARPs. The envisioned primary usage includes the development of user-centric privacy interfaces and other technical contributions in the area of data access rights. Moreover, these packages can also be used for examplified data analyses, although no substantive research questions can be answered using this data. In particular, this data does not reflect how data subjects behave in real world. However, it is representative enough to give a first impression on the types of data analysis possible when using real world data.    Data Generation  In order to allow cross-subject analysis, while keeping the re-identification risk minimal, we used research-only accounts for the data generation. A detailed explanation of the data generation method can be found in the paper corresponding to the dataset, accepted for the Annual Privacy Forum 2024. In short, two user profiles were designed and corresponding accounts were created for each of the five services. Then, those accounts were used for two to four month. During the usage period, we minimized the amount of identifying data and also avoided interactions with data subjects not part of this research. Afterwards, we performed a data access request via the controller's web interface. Finally, the data was cleansed as described in detail in the acconpanying paper and in brief within the following section.   Data Cleansing Before publication, both possibly identifying information and security relevant attributes need to be obfuscated or deleted. Moreover, multi-party data (especially messages with external entities) must be deleted. If data is obfuscated, we made sure to substitute multiple occurances of the same information with the same replacement.We provide a list of deleted and obfuscated items, the obfuscation scheme and, if applicable, the replacement. The list of obfuscated items looks like the following example: path filetype filename attribute scheme replacement linkedin\Linkedin_Basic csv messages.csv TO semantic description Firstname Lastname gooogle\Meine Aktivitäten\Datenexport html MeineAktivitäten.html IP Address loopback 127.142.201.194 facebook\personal_information json profile_information.json emails semantic description firstname.lastname@gmail.com   Data Characterization To give you an overview of the dataset, we publicly provide some meta-data about the usage time and SARP characteristics of exports from subject A/ subject B. provider usage time(in month) export options file types # subfolders # files export size Amazon 2/4 all categories CSV (32/49)EML (2/5)JPEG (1/2)JSON (3/3)PDF (9/10)TXT (4/4) 41/49 51/73 1.2 MB / 1.4 MB Apple 2/4 all datamax. 1 GB/ max. 4 GB CSV (8/3) 20/1 8/3 71.8 KB / 294.8 KB Facebook 2/4 all data JSON/HTML on my computer JSON (39/0)HTML (0/63)TXT (29/28)JPG (0/4)PNG (1/15)GIF (7/7) 45/76 76/117 12.3 MB / 13.5 MB Google 2/4 all data frequency once ZIP max. 4 GB HTML (8/11)CSV (10/13)JSON (27/28)TXT (14/14)PDF (1/1)MBOX (1/1)VCF (1/0)ICS (1/0)README (1/1)JPG (0/2) 44/51 64/71 1.54 MB /1.2 MB LinkedIn 2/4 all data CSV (18/21) 0/0 (part 1/2)0/0 (part 1/2) 13/1819/21 3.9 KB / 6.0 KB 6.2 KB / 9.2 KB Authors This data collection was performed by Daniela Pöhn (Universität der Bundeswehr München, Germany), Frank Pallas and Nicola Leschke (Paris Lodron Universität Salzburg, Austria). For questions, please contact nicola.leschke@plus.ac.at. Accompanying Paper The dataset was collected according to the method presented in:Leschke, Pöhn, and Pallas (2024). "How to Drill Into Silos: Creating a Free-to-Use Dataset of Data Subject Access Packages". Accepted for Annual Privacy Forum 2024.
创建时间:
2024-07-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作