five

avduarte333/arXivTection

收藏
Hugging Face2024-02-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/avduarte333/arXivTection
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - question-answering language: - en --- # 📄 arXivTection Dataset The arXivTection dataset serves as a benchmark designed for the task of detecting pretraining data from Large Language models. The dataset consists of 50 research papers extracted from arXiv. - 25 published in 2023: Non-Training data, "_label_" column = 0. - 25 published before 2022: Training data, "_label_" column = 1. From each paper ≈ 30 passages are extracted. Each passage is paraphrased 3 times using the Language Model Claude v2.0. <br> The "_Answer_" column indicates which of the passages is the real excerpt.<br> Passages are extracted to be on average ≈ 128 tokens in length. <br> # 🧪 Testing Models on arXivTection Our dataset is planned to be used on a Multiple-Choice-Question-Answering format. Nonetheless, it is compatible to be used with other pretraining data detection methods.<br> Our [GitHub](https://github.com/avduarte333/DE-COP_Method) repository contains example scripts to evaluate models on our dataset. <br> # 🤝 Compatibility The Multiple-Choice-Question-Answering task with our Dataset is designed to be applied to various models, such as:<br> - LLaMA-2 - Mistral - Mixtral - Chat-GPT (gpt-3.5-turbo-instruct) - GPT-3 (text-davinci-003) - Claude <br> # 🔧 Loading the Dataset ```python from datasets import load_dataset dataset = load_dataset("avduarte333/arXivTection") ``` <br> # 💬 Citation ```bibtex @misc{duarte2024decop, title={{DE-COP: Detecting Copyrighted Content in Language Models Training Data}}, author={André V. Duarte and Xuandong Zhao and Arlindo L. Oliveira and Lei Li}, year={2024}, eprint={2402.09910}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` <details> <summary>🎓 Research Papers Used</summary> [1] Attanasio, G., Plaza-del-Arco, F. M., Nozza, D., & Lauscher, A. (2023). A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for Fairer Instruction-Tuned Machine Translation. arXiv preprint arXiv:2310.12127. <br> [2] Shi, Y., Wu, L., & Shao, M. (2023). Adaptive End-to-End Metric Learning for Zero-Shot Cross-Domain Slot Filling. arXiv preprint arXiv:2310.15294. <br> [3] Keleg, A., Goldwater, S., & Magdy, W. (2023). ALDi: Quantifying the arabic level of dialectness of text. arXiv preprint arXiv:2310.13747. <br> [4] Su, Y., Ji, Y., Li, J., Ye, H., & Zhang, M. (2023, December). Beware of Model Collapse! Fast and Stable Test-time Adaptation for Robust Question Answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 12998-13011). <br> [5] Chang, Y., Lo, K., Goyal, T., & Iyyer, M. (2023). BooookScore: A systematic exploration of book-length summarization in the era of LLMs. arXiv preprint arXiv:2310.00785. <br> [6] Karamolegkou, A., Li, J., Zhou, L., & Søgaard, A. (2023). Copyright Violations and Large Language Models. arXiv preprint arXiv:2310.13771. <br> [7] Weissweiler, L., Hofmann, V., Kantharuban, A., Cai, A., Dutt, R., Hengle, A., ... & Mortensen, D. R. (2023). Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model. arXiv preprint arXiv:2310.15113. <br> [8] Li, Z., & Zhang, Y. (2023, December). Cultural Concept Adaptation on Multimodal Reasoning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 262-276). <br> [9] Jiang, W., Mao, Q., Li, J., Lin, C., Yang, W., Deng, T., & Wang, Z. (2023). DisCo: Distilled Student Models Co-training for Semi-supervised Text Mining. arXiv preprint arXiv:2305.12074. <br> [10] Zhu, Y., Si, J., Zhao, Y., Zhu, H., Zhou, D., & He, Y. (2023). EXPLAIN, EDIT, GENERATE: Rationale-Sensitive Counterfactual Data Augmentation for Multi-hop Fact Verification. arXiv preprint arXiv:2310.14508. <br> [11] Hada, R., Seth, A., Diddee, H., & Bali, K. (2023). ''Fifty Shades of Bias'': Normative Ratings of Gender Bias in GPT Generated English Text. arXiv preprint arXiv:2310.17428. <br> [12] Song, Y., & Dhariwal, P. (2023). Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189. <br> [13] Xu, W., Wang, D., Pan, L., Song, Z., Freitag, M., Wang, W. Y., & Li, L. (2023). Instructscore: Towards explainable text generation evaluation with automatic feedback. arXiv preprint arXiv:2305.14282. <br> [14] Majumder, B. P., He, Z., & McAuley, J. (2022). InterFair: Debiasing with Natural Language Feedback for Fair Interpretable Predictions. arXiv preprint arXiv:2210.07440. <br> [15] Yang, Z., Feng, R., Zhang, H., Shen, Y., Zhu, K., Huang, L., ... & Cheng, F. (2023). Eliminating Lipschitz Singularities in Diffusion Models. arXiv preprint arXiv:2306.11251. <br> [16] Li, J., Zhang, M., Guo, P., Zhang, M., & Zhang, Y. (2023). LLM-enhanced Self-training for Cross-domain Constituency Parsing. arXiv preprint arXiv:2311.02660. <br> [17] Ge, S., Zhang, Y., Liu, L., Zhang, M., Han, J., & Gao, J. (2023). Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. arXiv preprint arXiv:2310.01801. <br> [18] Eustratiadis, P., Dudziak, Ł., Li, D., & Hospedales, T. (2023). Neural Fine-Tuning Search for Few-Shot Learning. arXiv preprint arXiv:2306.09295. <br> [19] Zhang, Y., Zhang, Y., Cui, L., & Fu, G. (2023). Non-autoregressive text editing with copy-aware latent alignments. arXiv preprint arXiv:2310.07821. <br> [20] Tu, H., Li, Y., Mi, F., & Yang, Z. (2023). ReSee: Responding through Seeing Fine-grained Visual Knowledge in Open-domain Dialogue. arXiv preprint arXiv:2305.13602. <br> [21] Deng, Y., Zhang, W., Pan, S. J., & Bing, L. (2023). SOUL: Towards Sentiment and Opinion Understanding of Language. arXiv preprint arXiv:2310.17924. <br> [22] Singh, G., Ghosh, S., Verma, A., Painkra, C., & Ekbal, A. (2023, December). Standardizing Distress Analysis: Emotion-Driven Distress Identification and Cause Extraction (DICE) in Multimodal Online Posts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 4517-4532). <br> [23] Cao, Q., Kojima, T., Matsuo, Y., & Iwasawa, Y. (2023, December). Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 8898-8913). <br> [24] Cao, H., Yuan, L., Zhang, Y., & Ng, H. T. (2023, December). Unsupervised Grammatical Error Correction Rivaling Supervised Methods. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 3072-3088). <br> [25] Xu, S., Staufer, L., Ichim, O., Heri, C., & Grabmair, M. (2023). Vechr: A dataset for explainable and robust classification of vulnerability type in the european court of human rights. arXiv preprint arXiv:2310.11368. <br> [26] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. <br> [27] Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. <br> [28] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. <br> [29] Sarzynska-Wawer, J., Wawer, A., Pawlak, A., Szymanowska, J., Stefaniak, I., Jarkiewicz, M., & Okruszek, L. (2021). Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 304, 114135. <br> [30] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). <br> [31] Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. <br> [32] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485-5551. <br> [33] Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28. <br> [34] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27. <br> [35] Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). <br> [36] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9). <br> [37] Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248-255). Ieee. <br> [38] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25. <br> [39] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901. <br> [40] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9. <br> [41] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. <br> [42] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. <br> [43] Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. <br> [44] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27. <br> [45] Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. <br> [46] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 (pp. 234-241). Springer International Publishing. <br> [47] Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. <br> [48] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. <br> [49] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32. <br> [50] Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779-788). <br> </details>
提供机构:
avduarte333
原始信息汇总

📄 arXivTection Dataset

概述

arXivTection数据集是一个用于检测大型语言模型预训练数据的标准基准。该数据集包含50篇从arXiv提取的研究论文,分为两类:

  • 25篇发表于2023年的非训练数据,"label"列标记为0。
  • 25篇发表于2022年之前的训练数据,"label"列标记为1。

每篇论文提取约30个段落,每个段落通过语言模型Claude v2.0进行三次改述。"Answer"列指示哪个段落是真实摘录。段落的平均长度约为128个词。

使用场景

该数据集计划用于多选题问答格式,但也兼容其他预训练数据检测方法。

兼容性

该数据集适用于多种模型,包括:

  • LLaMA-2
  • Mistral
  • Mixtral
  • Chat-GPT (gpt-3.5-turbo-instruct)
  • GPT-3 (text-davinci-003)
  • Claude

加载数据集

python from datasets import load_dataset

dataset = load_dataset("avduarte333/arXivTection")

引用

bibtex @misc{duarte2024decop, title={{DE-COP: Detecting Copyrighted Content in Language Models Training Data}}, author={André V. Duarte and Xuandong Zhao and Arlindo L. Oliveira and Lei Li}, year={2024}, eprint={2402.09910}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作