knarasi1/student_and_llm_essays

Name: knarasi1/student_and_llm_essays
Creator: knarasi1
Published: 2024-02-16 19:42:26
License: 暂无描述

Hugging Face2024-02-16 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/knarasi1/student_and_llm_essays

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Academic Essay Prompt-Completion Pairs ## Dataset Description This dataset is designed to distinguish between essays authored by students and those generated by Large Language Models (LLMs), offering an essential resource for researchers and practitioners in natural language processing, educational technology, and academic integrity. Hosted on Huggingface, it supports the development and evaluation of models aimed at identifying the origin of textual content, which is crucial for a variety of applications including enhancing automated grading systems and detecting AI-generated text in academic submissions. ### Dataset Overview The dataset consists of prompt-completion pairs, carefully crafted to simulate real-world academic writing scenarios. Each entry within the dataset is uniquely structured, encapsulated by `<s>` and `</s>` tags, ensuring a standardized format. Within these tags, the prompt is specified between `[INST]` and `[/INST]` tags, comprising the Source Text, Essay Instructions, and the Essay. The completion, positioned outside the `[INST]` and `[/INST]` tags but still within the `<s>` and `</s>` encapsulation, categorically states the essay's origin—either as "This essay was written by an actual student." or "This essay was generated by a Large Language Model." This arrangement provides a nuanced classification task, focusing on discerning student-written essays from machine-generated ones. ### Structure Upon utilizing the `load_dataset` command on Huggingface to access the dataset, users will encounter two primary splits: - **Train:** Accounts for approximately 70% of the dataset, tailored for the training of machine learning models. - **Test:** Comprises the remaining 30%, designated for the assessment of the models' performance. #### Fields Each entry in the dataset is meticulously structured to include: - **Prompt:** Located within `[INST]` and `[/INST]` tags and encapsulated by `<s>` and `</s>` tags, the prompt includes the Source Text, Essay Instructions, and the Essay. - **Completion:** Situated outside the instructional tags yet within the `<s>` and `</s>` encapsulation, the completion provides a definitive statement regarding the essay's authorship, indicating it was either "student-written" or "machine-generated." ### Use Cases This dataset is exceptionally suited for: - Crafting algorithms that can autonomously distinguish between human-authored and AI-generated text. - Reinforcing academic integrity tools by identifying submissions that may be AI-generated. - Enhancing the capabilities of automated essay scoring systems by introducing them to a wide variety of textual origins. - Conducting in-depth research in natural language understanding, particularly in exploring the stylistic and content-based differences between human and AI authors. ### Accessing the Dataset To access and load the dataset into Python environments, use the following command through Huggingface's `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("knarasi1/student_and_llm_essays") ``` ### Acknowledgments This dataset represents a collective endeavor to foster innovation and uphold integrity in academic writing and research. It underscores the community's dedication to improving interactions between humans and AI within educational frameworks. ### Disclaimer Dataset users are urged to employ this resource ethically and responsibly, especially in light of its potential impact on educational and research settings. The creators of the dataset and Huggingface explicitly discourage the misuse of AI-generated text for academic dishonesty or any form of deception.

提供机构：

knarasi1

原始信息汇总

数据集卡片：学术论文提示-完成对

数据集描述

该数据集旨在区分学生撰写的论文和大型语言模型（LLM）生成的论文，为自然语言处理、教育技术和学术诚信领域的研究人员和从业者提供重要资源。该数据集支持开发和评估旨在识别文本内容来源的模型，这对于包括提高自动化评分系统和检测学术提交中的AI生成文本在内的各种应用至关重要。

数据集概览

数据集由精心设计的提示-完成对组成，模拟真实世界的学术写作场景。每个条目都采用标准化的格式，由<s>和</s>标签封装。在这些标签内，提示位于[INST]和[/INST]标签之间，包括源文本、论文说明和论文。完成部分位于[INST]和[/INST]标签之外，但仍在<s>和</s>封装内，明确指出论文的来源——要么是“这篇论文是由实际学生撰写的。”要么是“这篇论文是由大型语言模型生成的。”这种安排提供了一个细致的分类任务，专注于区分学生撰写的论文和机器生成的论文。