darklight03/StackMIA

Name: darklight03/StackMIA
Creator: darklight03
Published: 2024-05-21 02:23:44
License: 暂无描述

Hugging Face2024-05-21 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/darklight03/StackMIA

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en size_categories: - 1K<n<10K --- ## Overview The **StakcMIA** dataset serves as a dynamic dataset framework for membership inference attack (MIA) topic. > 1. **StackMIA** is build based on the [Stack Exchange](https://archive.org/details/stackexchange) corpus, which is widely used for pre-training. > 2. **StackMIA** provides fine-grained release times (timestamps) to ensure reliability and applicability to newly released LLMs. See our paper (to-be-released) for detailed description. ## Applicability Our dataset supports most white- and black-box Large Language Models (LLMs), which are <span style="color:red;">pretrained with Stack Exchange corpus</span> : - **Black-box OpenAI models:** - *text-davinci-001* - *text-davinci-002* - *...* - **White-box models:** - *LLaMA and LLaMA2* - *Pythia* - *GPT-Neo* - *GPT-J* - *OPT* - *StableLM* - *Falcon* - *...* Based on our [StackMIAsub](https://huggingface.co/datasets/darklight03/StackMIAsub), researchers can <span style="color:blue;">dynamically construct non-members components</span> based on fine-grained timestamps to adapt to LLMs released at different times. ## Related repo To run our PAC method to perform membership inference attack, visit our [code repo](https://github.com/yyy01/PAC). ## Cite our work ⭐️ If you find our dataset helpful, please kindly cite our work : ```bibtex @misc{ye2024data, title={Data Contamination Calibration for Black-box LLMs}, author={Wentao Ye and Jiaqi Hu and Liyao Li and Haobo Wang and Gang Chen and Junbo Zhao}, year={2024}, eprint={2405.11930}, archivePrefix={arXiv}, primaryClass={cs.LG} } ```

提供机构：

darklight03

原始信息汇总

数据集概述

StackMIA 数据集是一个用于会员推理攻击（MIA）主题的动态数据集框架。该数据集基于 Stack Exchange 语料库构建，提供了细粒度的时间戳，以确保对新发布的LLMs的可靠性和适用性。

适用性

该数据集支持大多数使用Stack Exchange语料库预训练的白色和黑色大型语言模型（LLMs），包括：

黑色模型：
- text-davinci-001
- text-davinci-002
- ...
白色模型：
- LLaMA 和 LLaMA2
- Pythia
- GPT-Neo
- GPT-J
- OPT
- StableLM
- Falcon
- ...

研究人员可以根据细粒度的时间戳动态构建非会员组件，以适应不同时间发布的LLMs。

5,000+

优质数据集

54 个

任务类型

进入经典数据集