zjysteven/WikiMIA_paraphrased_perturbed

Name: zjysteven/WikiMIA_paraphrased_perturbed
Creator: zjysteven
Published: 2024-04-05 03:41:51
License: 暂无描述

Hugging Face2024-04-05 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/zjysteven/WikiMIA_paraphrased_perturbed

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: input dtype: string - name: label dtype: int64 splits: - name: WikiMIA_length32_paraphrased num_bytes: 163365 num_examples: 776 - name: WikiMIA_length64_paraphrased num_bytes: 224644 num_examples: 542 - name: WikiMIA_length128_paraphrased num_bytes: 206645 num_examples: 250 - name: WikiMIA_length32_perturbed num_bytes: 1650773 num_examples: 7760 - name: WikiMIA_length64_perturbed num_bytes: 2255354 num_examples: 5420 - name: WikiMIA_length128_perturbed num_bytes: 2092896 num_examples: 2500 - name: WikiMIA_length32_paraphrased_perturbed num_bytes: 1662467 num_examples: 7760 - name: WikiMIA_length64_paraphrased_perturbed num_bytes: 2286059 num_examples: 5420 - name: WikiMIA_length128_paraphrased_perturbed num_bytes: 2105242 num_examples: 2500 download_size: 3282711 dataset_size: 12647445 configs: - config_name: default data_files: - split: WikiMIA_length32_paraphrased path: data/WikiMIA_length32_paraphrased-* - split: WikiMIA_length64_paraphrased path: data/WikiMIA_length64_paraphrased-* - split: WikiMIA_length128_paraphrased path: data/WikiMIA_length128_paraphrased-* - split: WikiMIA_length32_perturbed path: data/WikiMIA_length32_perturbed-* - split: WikiMIA_length64_perturbed path: data/WikiMIA_length64_perturbed-* - split: WikiMIA_length128_perturbed path: data/WikiMIA_length128_perturbed-* - split: WikiMIA_length32_paraphrased_perturbed path: data/WikiMIA_length32_paraphrased_perturbed-* - split: WikiMIA_length64_paraphrased_perturbed path: data/WikiMIA_length64_paraphrased_perturbed-* - split: WikiMIA_length128_paraphrased_perturbed path: data/WikiMIA_length128_paraphrased_perturbed-* license: mit --- ## 📘 WikiMIA paraphrased and perturbed versions The WikiMIA dataset serves as a benchmark designed to evaluate membership inference attack (MIA) methods, specifically in detecting pretraining data from extensive large language models. It is originally constructed by Shi et al. (see the [original data repo](https://huggingface.co/datasets/swj0419/WikiMIA) for more details). - The authors studied a *paraphrased* setting in their paper, where instead of detecting verbatim training texts, the goal is to detect (slightly) paraphrased version. Unfortunately they didn't release such data splits. Here we provide our paraphrased version, which is obtained by instructing ChatGPT to replace certain number of words without changing the original semantic meaning. - We further provide perturbed versions of WikiMIA, which are necessary to run the Neighbor attack. Perturbed versions are obtained by perturbing each input sentence with masked language model. For each input we have perturbed 10 times so you don't have to repeat this process yourself (which can be time consuming). ## 💻 Loading the datasets To load the dataset: ```python from datasets import load_dataset LENGTH = 32 SPLIT_NAME = "paraphrased" dataset = load_dataset("zjysteven/WikiMIA_paraphrased_perturbed", split=f"WikiMIA_length{LENGTH}_{SPLIT_NAME}") ``` * LENGTH: choose from `32, 64, 128`, which is the length of the input text. * SPLIT_NAME: choose from `"paraphrased", "perturbed", "paraphrased_perturbed"`. * *Label 0*: Refers to the unseen (non-training) data during pretraining. *Label 1*: Refers to the seen (training) data. ## 🛠️ Codebase For more details on evaluating multiple MIA methods on these WikiMIA datasets, visit our [GitHub repository](https://github.com/zjysteven/mink-plus-plus), where we also propose a novel method, **Min-K%++**, that significantly outperforms both the Min-K% by Shi et al. and other baseline methods. ## ⭐ Citing our Work If you find our codebase and datasets beneficial, kindly cite our work and the original WikiMIA: ```bibtex @misc{zhang2024mink, title={Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models}, author={Jingyang Zhang and Jingwei Sun and Eric Yeats and Yang Ouyang and Martin Kuo and Jianyi Zhang and Hao Yang and Hai Li}, year={2024}, eprint={2404.02936}, archivePrefix={arXiv}, primaryClass={cs.CL} } @inproceedings{ shi2024detecting, title={Detecting Pretraining Data from Large Language Models}, author={Weijia Shi and Anirudh Ajith and Mengzhou Xia and Yangsibo Huang and Daogao Liu and Terra Blevins and Danqi Chen and Luke Zettlemoyer}, booktitle={The Twelfth International Conference on Learning Representations}, year={2024}, url={https://openreview.net/forum?id=zWqr3MQuNs} } ```

提供机构：

zjysteven

原始信息汇总

数据集概述

数据集特征

input: 数据类型为字符串（string）。
label: 数据类型为整数（int64）。

数据集分割

WikiMIA_length32_paraphrased: 示例数776，字节数163365。
WikiMIA_length64_paraphrased: 示例数542，字节数224644。
WikiMIA_length128_paraphrased: 示例数250，字节数206645。
WikiMIA_length32_perturbed: 示例数7760，字节数1650773。
WikiMIA_length64_perturbed: 示例数5420，字节数2255354。
WikiMIA_length128_perturbed: 示例数2500，字节数2092896。
WikiMIA_length32_paraphrased_perturbed: 示例数7760，字节数1662467。
WikiMIA_length64_paraphrased_perturbed: 示例数5420，字节数2286059。
WikiMIA_length128_paraphrased_perturbed: 示例数2500，字节数2105242。

数据集大小

下载大小: 3282711字节。
数据集大小: 12647445字节。

配置文件

默认配置: 包含所有分割的数据文件路径。

许可证

MIT许可证。

5,000+

优质数据集

54 个

任务类型

进入经典数据集