five

darklight03/StackMIA

收藏
Hugging Face2024-05-21 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/darklight03/StackMIA
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en size_categories: - 1K<n<10K --- ## Overview The **StakcMIA** dataset serves as a dynamic dataset framework for membership inference attack (MIA) topic. > 1. **StackMIA** is build based on the [Stack Exchange](https://archive.org/details/stackexchange) corpus, which is widely used for pre-training. > 2. **StackMIA** provides fine-grained release times (timestamps) to ensure reliability and applicability to newly released LLMs. See our paper (to-be-released) for detailed description. ## Applicability Our dataset supports most white- and black-box Large Language Models (LLMs), which are <span style="color:red;">pretrained with Stack Exchange corpus</span> : - **Black-box OpenAI models:** - *text-davinci-001* - *text-davinci-002* - *...* - **White-box models:** - *LLaMA and LLaMA2* - *Pythia* - *GPT-Neo* - *GPT-J* - *OPT* - *StableLM* - *Falcon* - *...* Based on our [StackMIAsub](https://huggingface.co/datasets/darklight03/StackMIAsub), researchers can <span style="color:blue;">dynamically construct non-members components</span> based on fine-grained timestamps to adapt to LLMs released at different times. ## Related repo To run our PAC method to perform membership inference attack, visit our [code repo](https://github.com/yyy01/PAC). ## Cite our work ⭐️ If you find our dataset helpful, please kindly cite our work : ```bibtex @misc{ye2024data, title={Data Contamination Calibration for Black-box LLMs}, author={Wentao Ye and Jiaqi Hu and Liyao Li and Haobo Wang and Gang Chen and Junbo Zhao}, year={2024}, eprint={2405.11930}, archivePrefix={arXiv}, primaryClass={cs.LG} } ```
提供机构:
darklight03
原始信息汇总

数据集概述

StackMIA 数据集是一个用于会员推理攻击(MIA)主题的动态数据集框架。该数据集基于 Stack Exchange 语料库构建,提供了细粒度的时间戳,以确保对新发布的LLMs的可靠性和适用性。

适用性

该数据集支持大多数使用Stack Exchange语料库预训练的白色和黑色大型语言模型(LLMs),包括:

  • 黑色模型:
    • text-davinci-001
    • text-davinci-002
    • ...
  • 白色模型:
    • LLaMA 和 LLaMA2
    • Pythia
    • GPT-Neo
    • GPT-J
    • OPT
    • StableLM
    • Falcon
    • ...

研究人员可以根据细粒度的时间戳动态构建非会员组件,以适应不同时间发布的LLMs。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作