A dataset of Data Subject Access Request Packages
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11634937
下载链接
链接失效反馈官方服务:
资源简介:
Overview
This dataset is a minimal example of Data Subject Access Request Packages (SARPs), as they can be retrieved under data protection laws, specifically the GDPR. It includes data from two data subjects, each with accounts for five major sevices, namely Amazon, Apple, Facebook, Google, and Linkedin.
Purpose and Usage
This dataset is meant to be an initial dataset that allows for manual exploration of structures and contents found in SARPs. Hence, the number of controllers and user profiles should be minimal but sufficient to allow cross-subject and cross-controller analysis. This dataset can be used to explore structures, formats and data types found in real-world SARPs. Thereby, the planning of future SARP-based research projects and studies shall be facilitated.We invite other researchers to use this dataset to explore the structure of SARPs. The envisioned primary usage includes the development of user-centric privacy interfaces and other technical contributions in the area of data access rights. Moreover, these packages can also be used for examplified data analyses, although no substantive research questions can be answered using this data. In particular, this data does not reflect how data subjects behave in real world. However, it is representative enough to give a first impression on the types of data analysis possible when using real world data.
Data Generation
In order to allow cross-subject analysis, while keeping the re-identification risk minimal, we used research-only accounts for the data generation. A detailed explanation of the data generation method can be found in the paper corresponding to the dataset, accepted for the Annual Privacy Forum 2024.
In short, two user profiles were designed and corresponding accounts were created for each of the five services. Then, those accounts were used for two to four month. During the usage period, we minimized the amount of identifying data and also avoided interactions with data subjects not part of this research. Afterwards, we performed a data access request via the controller's web interface. Finally, the data was cleansed as described in detail in the acconpanying paper and in brief within the following section.
Data Cleansing
Before publication, both possibly identifying information and security relevant attributes need to be obfuscated or deleted. Moreover, multi-party data (especially messages with external entities) must be deleted. If data is obfuscated, we made sure to substitute multiple occurances of the same information with the same replacement.We provide a list of deleted and obfuscated items, the obfuscation scheme and, if applicable, the replacement.
The list of obfuscated items looks like the following example:
path
filetype
filename
attribute
scheme
replacement
linkedin\Linkedin_Basic
csv
messages.csv
TO
semantic description
Firstname Lastname
gooogle\Meine Aktivitäten\Datenexport
html
MeineAktivitäten.html
IP Address
loopback
127.142.201.194
facebook\personal_information
json
profile_information.json
emails
semantic description
firstname.lastname@gmail.com
Data Characterization
To give you an overview of the dataset, we publicly provide some meta-data about the usage time and SARP characteristics of exports from subject A/ subject B.
provider
usage time(in month)
export options
file types
# subfolders
# files
export size
Amazon
2/4
all categories
CSV (32/49)EML (2/5)JPEG (1/2)JSON (3/3)PDF (9/10)TXT (4/4)
41/49
51/73
1.2 MB / 1.4 MB
Apple
2/4
all datamax. 1 GB/ max. 4 GB
CSV (8/3)
20/1
8/3
71.8 KB / 294.8 KB
Facebook
2/4
all data
JSON/HTML
on my computer
JSON (39/0)HTML (0/63)TXT (29/28)JPG (0/4)PNG (1/15)GIF (7/7)
45/76
76/117
12.3 MB / 13.5 MB
Google
2/4
all data
frequency once
ZIP
max. 4 GB
HTML (8/11)CSV (10/13)JSON (27/28)TXT (14/14)PDF (1/1)MBOX (1/1)VCF (1/0)ICS (1/0)README (1/1)JPG (0/2)
44/51
64/71
1.54 MB /1.2 MB
LinkedIn
2/4
all data
CSV (18/21)
0/0 (part 1/2)0/0 (part 1/2)
13/1819/21
3.9 KB / 6.0 KB
6.2 KB / 9.2 KB
Authors
This data collection was performed by Daniela Pöhn (Universität der Bundeswehr München, Germany), Frank Pallas and Nicola Leschke (Paris Lodron Universität Salzburg, Austria). For questions, please contact nicola.leschke@plus.ac.at.
Accompanying Paper
The dataset was collected according to the method presented in:Leschke, Pöhn, and Pallas (2024). "How to Drill Into Silos: Creating a Free-to-Use Dataset of Data Subject Access Packages". Accepted for Annual Privacy Forum 2024.
创建时间:
2024-07-05



