five

MDD-TD: Large Language Model Text Source and Content Authenticity Detection Dataset

收藏
DataCite Commons2025-11-05 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=bb4f687be05549f791fe7d555ea9bd83
下载链接
链接失效反馈
官方服务:
资源简介:
With the rapid advancement of large language models (LLMs), detecting text generated by LLMs has garnered increasing attention. To address the scarcity of text detection datasets, this study proposes and constructs a high-quality, multi-domain text detection dataset, MDD-TD (Multi-Domain Text Detection Dataset), based on two detection tasks: source detection and content authenticity verification. The dataset sources encompass three dimensions: translation-optimized open-source data, web-crawled open-source data, and prompt-augmented synthetic data. Translation corpora were derived from the SimpleAI/HC3 dataset by selecting high-quality responses for translation and refinement. Web-crawled open-source corpora were obtained by scraping and curating data from Weibo and Douban platforms. Synthetic data was generated using rule-driven methods, leveraging existing translation and web-sourced data through various prompt strategies. For quality control, the PPL method was first applied to calculate perplexity distributions using language models, removing texts with abnormal perplexity. Semantic similarity-based deduplication maintained diversity, supplemented by manual review to ensure data authenticity and reliability. Ultimately, 30,000 high-quality data points were selected and stored in JSON format, containing text, source, and authenticity labels, and categorized into three core detection tasks: Q&A, comments, and news texts. The MDD-TD dataset holds significant value for text source tracing and content authenticity verification, supporting research and applications in large-model tasks such as generative detection and misinformation governance, thereby enhancing model credibility and security.
提供机构:
Science Data Bank
创建时间:
2025-10-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作