zjysteven/WikiMIA_paraphrased_perturbed
收藏Hugging Face2024-04-05 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/zjysteven/WikiMIA_paraphrased_perturbed
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: input
dtype: string
- name: label
dtype: int64
splits:
- name: WikiMIA_length32_paraphrased
num_bytes: 163365
num_examples: 776
- name: WikiMIA_length64_paraphrased
num_bytes: 224644
num_examples: 542
- name: WikiMIA_length128_paraphrased
num_bytes: 206645
num_examples: 250
- name: WikiMIA_length32_perturbed
num_bytes: 1650773
num_examples: 7760
- name: WikiMIA_length64_perturbed
num_bytes: 2255354
num_examples: 5420
- name: WikiMIA_length128_perturbed
num_bytes: 2092896
num_examples: 2500
- name: WikiMIA_length32_paraphrased_perturbed
num_bytes: 1662467
num_examples: 7760
- name: WikiMIA_length64_paraphrased_perturbed
num_bytes: 2286059
num_examples: 5420
- name: WikiMIA_length128_paraphrased_perturbed
num_bytes: 2105242
num_examples: 2500
download_size: 3282711
dataset_size: 12647445
configs:
- config_name: default
data_files:
- split: WikiMIA_length32_paraphrased
path: data/WikiMIA_length32_paraphrased-*
- split: WikiMIA_length64_paraphrased
path: data/WikiMIA_length64_paraphrased-*
- split: WikiMIA_length128_paraphrased
path: data/WikiMIA_length128_paraphrased-*
- split: WikiMIA_length32_perturbed
path: data/WikiMIA_length32_perturbed-*
- split: WikiMIA_length64_perturbed
path: data/WikiMIA_length64_perturbed-*
- split: WikiMIA_length128_perturbed
path: data/WikiMIA_length128_perturbed-*
- split: WikiMIA_length32_paraphrased_perturbed
path: data/WikiMIA_length32_paraphrased_perturbed-*
- split: WikiMIA_length64_paraphrased_perturbed
path: data/WikiMIA_length64_paraphrased_perturbed-*
- split: WikiMIA_length128_paraphrased_perturbed
path: data/WikiMIA_length128_paraphrased_perturbed-*
license: mit
---
## 📘 WikiMIA paraphrased and perturbed versions
The WikiMIA dataset serves as a benchmark designed to evaluate membership inference attack (MIA) methods, specifically in detecting pretraining data from extensive large language models.
It is originally constructed by Shi et al. (see the [original data repo](https://huggingface.co/datasets/swj0419/WikiMIA) for more details).
- The authors studied a *paraphrased* setting in their paper, where instead of detecting verbatim training texts, the goal is to detect (slightly) paraphrased version. Unfortunately they didn't
release such data splits. Here we provide our paraphrased version, which is obtained by instructing ChatGPT to replace certain number of words without changing the original semantic meaning.
- We further provide perturbed versions of WikiMIA, which are necessary to run the Neighbor attack. Perturbed versions are obtained by perturbing each input sentence with masked language model.
For each input we have perturbed 10 times so you don't have to repeat this process yourself (which can be time consuming).
## 💻 Loading the datasets
To load the dataset:
```python
from datasets import load_dataset
LENGTH = 32
SPLIT_NAME = "paraphrased"
dataset = load_dataset("zjysteven/WikiMIA_paraphrased_perturbed", split=f"WikiMIA_length{LENGTH}_{SPLIT_NAME}")
```
* LENGTH: choose from `32, 64, 128`, which is the length of the input text.
* SPLIT_NAME: choose from `"paraphrased", "perturbed", "paraphrased_perturbed"`.
* *Label 0*: Refers to the unseen (non-training) data during pretraining. *Label 1*: Refers to the seen (training) data.
## 🛠️ Codebase
For more details on evaluating multiple MIA methods on these WikiMIA datasets, visit our [GitHub repository](https://github.com/zjysteven/mink-plus-plus), where we also propose
a novel method, **Min-K%++**, that significantly outperforms both the Min-K% by Shi et al. and other baseline methods.
## ⭐ Citing our Work
If you find our codebase and datasets beneficial, kindly cite our work and the original WikiMIA:
```bibtex
@misc{zhang2024mink,
title={Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models},
author={Jingyang Zhang and Jingwei Sun and Eric Yeats and Yang Ouyang and Martin Kuo and Jianyi Zhang and Hao Yang and Hai Li},
year={2024},
eprint={2404.02936},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{
shi2024detecting,
title={Detecting Pretraining Data from Large Language Models},
author={Weijia Shi and Anirudh Ajith and Mengzhou Xia and Yangsibo Huang and Daogao Liu and Terra Blevins and Danqi Chen and Luke Zettlemoyer},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=zWqr3MQuNs}
}
```
提供机构:
zjysteven
原始信息汇总
数据集概述
数据集特征
- input: 数据类型为字符串(string)。
- label: 数据类型为整数(int64)。
数据集分割
- WikiMIA_length32_paraphrased: 示例数776,字节数163365。
- WikiMIA_length64_paraphrased: 示例数542,字节数224644。
- WikiMIA_length128_paraphrased: 示例数250,字节数206645。
- WikiMIA_length32_perturbed: 示例数7760,字节数1650773。
- WikiMIA_length64_perturbed: 示例数5420,字节数2255354。
- WikiMIA_length128_perturbed: 示例数2500,字节数2092896。
- WikiMIA_length32_paraphrased_perturbed: 示例数7760,字节数1662467。
- WikiMIA_length64_paraphrased_perturbed: 示例数5420,字节数2286059。
- WikiMIA_length128_paraphrased_perturbed: 示例数2500,字节数2105242。
数据集大小
- 下载大小: 3282711字节。
- 数据集大小: 12647445字节。
配置文件
- 默认配置: 包含所有分割的数据文件路径。
许可证
- MIT许可证。



