five

knarasi1/student_and_llm_essays

收藏
Hugging Face2024-02-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/knarasi1/student_and_llm_essays
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Academic Essay Prompt-Completion Pairs ## Dataset Description This dataset is designed to distinguish between essays authored by students and those generated by Large Language Models (LLMs), offering an essential resource for researchers and practitioners in natural language processing, educational technology, and academic integrity. Hosted on Huggingface, it supports the development and evaluation of models aimed at identifying the origin of textual content, which is crucial for a variety of applications including enhancing automated grading systems and detecting AI-generated text in academic submissions. ### Dataset Overview The dataset consists of prompt-completion pairs, carefully crafted to simulate real-world academic writing scenarios. Each entry within the dataset is uniquely structured, encapsulated by `<s>` and `</s>` tags, ensuring a standardized format. Within these tags, the prompt is specified between `[INST]` and `[/INST]` tags, comprising the Source Text, Essay Instructions, and the Essay. The completion, positioned outside the `[INST]` and `[/INST]` tags but still within the `<s>` and `</s>` encapsulation, categorically states the essay's origin—either as "This essay was written by an actual student." or "This essay was generated by a Large Language Model." This arrangement provides a nuanced classification task, focusing on discerning student-written essays from machine-generated ones. ### Structure Upon utilizing the `load_dataset` command on Huggingface to access the dataset, users will encounter two primary splits: - **Train:** Accounts for approximately 70% of the dataset, tailored for the training of machine learning models. - **Test:** Comprises the remaining 30%, designated for the assessment of the models' performance. #### Fields Each entry in the dataset is meticulously structured to include: - **Prompt:** Located within `[INST]` and `[/INST]` tags and encapsulated by `<s>` and `</s>` tags, the prompt includes the Source Text, Essay Instructions, and the Essay. - **Completion:** Situated outside the instructional tags yet within the `<s>` and `</s>` encapsulation, the completion provides a definitive statement regarding the essay's authorship, indicating it was either "student-written" or "machine-generated." ### Use Cases This dataset is exceptionally suited for: - Crafting algorithms that can autonomously distinguish between human-authored and AI-generated text. - Reinforcing academic integrity tools by identifying submissions that may be AI-generated. - Enhancing the capabilities of automated essay scoring systems by introducing them to a wide variety of textual origins. - Conducting in-depth research in natural language understanding, particularly in exploring the stylistic and content-based differences between human and AI authors. ### Accessing the Dataset To access and load the dataset into Python environments, use the following command through Huggingface's `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("knarasi1/student_and_llm_essays") ``` ### Acknowledgments This dataset represents a collective endeavor to foster innovation and uphold integrity in academic writing and research. It underscores the community's dedication to improving interactions between humans and AI within educational frameworks. ### Disclaimer Dataset users are urged to employ this resource ethically and responsibly, especially in light of its potential impact on educational and research settings. The creators of the dataset and Huggingface explicitly discourage the misuse of AI-generated text for academic dishonesty or any form of deception.
提供机构:
knarasi1
原始信息汇总

数据集卡片:学术论文提示-完成对

数据集描述

该数据集旨在区分学生撰写的论文和大型语言模型(LLM)生成的论文,为自然语言处理、教育技术和学术诚信领域的研究人员和从业者提供重要资源。该数据集支持开发和评估旨在识别文本内容来源的模型,这对于包括提高自动化评分系统和检测学术提交中的AI生成文本在内的各种应用至关重要。

数据集概览

数据集由精心设计的提示-完成对组成,模拟真实世界的学术写作场景。每个条目都采用标准化的格式,由<s></s>标签封装。在这些标签内,提示位于[INST][/INST]标签之间,包括源文本、论文说明和论文。完成部分位于[INST][/INST]标签之外,但仍在<s></s>封装内,明确指出论文的来源——要么是“这篇论文是由实际学生撰写的。”要么是“这篇论文是由大型语言模型生成的。”这种安排提供了一个细致的分类任务,专注于区分学生撰写的论文和机器生成的论文。

结构

使用Huggingface上的load_dataset命令访问数据集时,用户将遇到两个主要部分:

  • 训练集: 约占数据集的70%,专为机器学习模型的训练而设计。
  • 测试集: 包含剩余的30%,用于评估模型的性能。

字段

数据集中的每个条目都经过精心结构化,包括:

  • 提示: 位于[INST][/INST]标签之间,并由<s></s>标签封装,包括源文本、论文说明和论文。
  • 完成: 位于教学标签之外,但仍在<s></s>封装内,提供关于论文作者身份的明确声明,指示它是“学生撰写的”还是“机器生成的”。

使用案例

该数据集非常适合:

  • 开发能够自主区分人类作者和AI生成文本的算法。
  • 通过识别可能是AI生成的提交来强化学术诚信工具。
  • 通过引入各种文本来源来增强自动论文评分系统的能力。
  • 在自然语言理解领域进行深入研究,特别是在探索人类和AI作者之间的风格和内容差异方面。

访问数据集

要在Python环境中访问和加载数据集,请使用Huggingface的datasets库中的以下命令:

python from datasets import load_dataset

dataset = load_dataset("knarasi1/student_and_llm_essays")

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作