darklight03/StackMIA
收藏Hugging Face2024-05-21 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/darklight03/StackMIA
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
size_categories:
- 1K<n<10K
---
## Overview
The **StakcMIA** dataset serves as a dynamic dataset framework for membership inference attack (MIA) topic.
> 1. **StackMIA** is build based on the [Stack Exchange](https://archive.org/details/stackexchange) corpus, which is widely used for pre-training.
> 2. **StackMIA** provides fine-grained release times (timestamps) to ensure reliability and applicability to newly released LLMs.
See our paper (to-be-released) for detailed description.
## Applicability
Our dataset supports most white- and black-box Large Language Models (LLMs), which are <span style="color:red;">pretrained with Stack Exchange corpus</span> :
- **Black-box OpenAI models:**
- *text-davinci-001*
- *text-davinci-002*
- *...*
- **White-box models:**
- *LLaMA and LLaMA2*
- *Pythia*
- *GPT-Neo*
- *GPT-J*
- *OPT*
- *StableLM*
- *Falcon*
- *...*
Based on our [StackMIAsub](https://huggingface.co/datasets/darklight03/StackMIAsub), researchers can <span style="color:blue;">dynamically construct non-members components</span> based on fine-grained timestamps to adapt to LLMs released at different times.
## Related repo
To run our PAC method to perform membership inference attack, visit our [code repo](https://github.com/yyy01/PAC).
## Cite our work
⭐️ If you find our dataset helpful, please kindly cite our work :
```bibtex
@misc{ye2024data,
title={Data Contamination Calibration for Black-box LLMs},
author={Wentao Ye and Jiaqi Hu and Liyao Li and Haobo Wang and Gang Chen and Junbo Zhao},
year={2024},
eprint={2405.11930},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
提供机构:
darklight03
原始信息汇总
数据集概述
StackMIA 数据集是一个用于会员推理攻击(MIA)主题的动态数据集框架。该数据集基于 Stack Exchange 语料库构建,提供了细粒度的时间戳,以确保对新发布的LLMs的可靠性和适用性。
适用性
该数据集支持大多数使用Stack Exchange语料库预训练的白色和黑色大型语言模型(LLMs),包括:
- 黑色模型:
- text-davinci-001
- text-davinci-002
- ...
- 白色模型:
- LLaMA 和 LLaMA2
- Pythia
- GPT-Neo
- GPT-J
- OPT
- StableLM
- Falcon
- ...
研究人员可以根据细粒度的时间戳动态构建非会员组件,以适应不同时间发布的LLMs。



