five

imperial-cpg/arxiv_redpajama_2302

收藏
Hugging Face2024-10-08 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/imperial-cpg/arxiv_redpajama_2302
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: meta struct: - name: arxiv_id dtype: string - name: language dtype: string - name: timestamp dtype: string - name: url dtype: string - name: yymm dtype: string - name: text dtype: string splits: - name: train num_bytes: 857168232 num_examples: 13155 download_size: 382068275 dataset_size: 857168232 --- # ArXiv papers from RedPajama-Data originally published in February 2023 We collect the ArXiv papers released shortly before the training data cutoff date for the [OpenLLaMA models](https://huggingface.co/openlm-research/open_llama_7b). The OpenLLaMA models (V1) have been trained on [RedPajama data](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T). The last batch of ArXiv papers included in this dataset are papers published in February 2023. To get the members close to the cutoff data, we collect the 13,155 papers published in "2302" as part of the training dataset. We process the raw LateX files using this [script](https://github.com/togethercomputer/RedPajama-Data/blob/rp_v1/data_prep/arxiv/run_clean.py). This dataset has been used as source for 'member' documents to develop (document-level) MIAs against LLMs using data collected shortly before (member) and after (non-member) the training cutoff date for the target model ([the suite of OpenLLaMA models](https://huggingface.co/openlm-research/open_llama_7b)). For non-members for the RDD setup, we refer to our [Github repo](https://github.com/computationalprivacy/mia_llms_benchmark/tree/main/document_level). For more details and results see the section of Regression Discontiuity Design (RDD) in the paper ["SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It)"](https://arxiv.org/pdf/2406.17975).
提供机构:
imperial-cpg
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作