neelgupta2112/Wildchat-1M-English-Fiction-Labels

Name: neelgupta2112/Wildchat-1M-English-Fiction-Labels
Creator: neelgupta2112
Published: 2026-04-03 20:18:36
License: 暂无描述

Hugging Face2026-04-03 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/neelgupta2112/Wildchat-1M-English-Fiction-Labels

下载链接

链接失效反馈

官方服务：

资源简介：

--- # For reference on dataset card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1 # Doc / guide: https://huggingface.co/docs/hub/datasets-cards {} --- # Dataset Card for Wildchat-1M-English-Fiction-Labels  This dataset card provides information for the dataset attached to the AI Fiction in the Wild(chat) paper. The dataset presentes over 500K English Wildchat conversations that have been labelled by an LLM across three axis, fictional, fanfiction, and sexually explicit. The goal of this project is to understand the types of fiction being generated by real LLM users. We use the original [**Wildchat dataset**](https://huggingface.co/datasets/allenai/WildChat-1M-Full?not-for-all-audiences=true), before filtering down to only English Language conversations, and then running our language model labelling output. We use the following prompt to generate our three additional columns. ## Prompt Used for Classification Task ``` You are tasked with labeling a conversation between a user and chatbot based on three axes: fiction, fanfiction, and sexually explicit content. ### Definitions and Clarifications: **Fiction**: The user asks the chatbot to produce content that is imaginative, speculative, or not grounded in real-world facts. This includes creating original stories, speculative scenarios, or alternate histories. Purely descriptive or encyclopedic prompts about existing fictional properties *do not count as fiction*. **Fanfiction**: A type of fiction that uses characters, settings, or plot elements from *existing intellectual properties* (IPs), including but not limited to TV shows, movies, video games, anime, or books. Examples of fanfiction: - “Sonic and Pikachu explore Hogwarts.” - “What if Darth Vader fought Goku?” **Sexually explicit**: The user attempting to prompt the chatbot into erotic content. This includes: - Direct sexual acts or dialogue - Heavy innuendo or clearly suggestive situations - Conversations where the chatbot does not carry out the erotic prompting, but the intent in the prompt is still there. ### Label each conversation accurately. If unsure, err on the side of inclusiveness. --- Example Input: Conversation: "USER: What if Arya was a Lady? CHATBOT: If Arya was a Lady, it would change her story quite drastically. As a Lady, she would be expected to follow traditional gender roles and marry into a powerful family to secure political alliances. She would also be expected to focus on learning practical skills such as sewing, dancing, and music, rather than swordplay and archery. However, given Arya's rebellious nature and her desire to break free from societal norms, it is possible that she would resist these expectations and continue to pursue her interest in combat and adventure. She might face more pushback and resistance from her family and society, but she could still find a way to achieve her goals and remain true to herself. Overall, while being a Lady would certainly change Arya's story, it might not drastically alter her character or resolve." Example Output: { "is fiction": true, "is fanfiction": true, "is sexually explicit": false } --- Instructions: Now, here is the real input: ``` We used GPT-4o-mini to label all conversations. We hand-labelled 300 conversations and against our sample, the model achieved the following metrics. | Category | Precision | Recall | F1 | |:-----------|----------:|-------:|-----:| | Fiction | 0.97 | 0.94 | 0.95 | | Fanfiction | 0.94 | 0.85 | 0.89 | | Explicit | 0.84 | 0.69 | 0.76 | ## Dataset Details ### Dataset Description  - **Paper:** - **Languages**: - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Personal and Sensitive Information The data has been de-identified with Microsoft Presidio and hand-written rules by the original Wildchat authors. ### Data Fields We inherit all fields from the Wildchat-1M-Full dataset and add three of our own. - fiction_gpt: If GPT-4o-mini labelled the first three turns of the conversation as fiction. - fanfiction_gpt: If GPT-4o-mini labelled the first three turns of the conversation as fanfiction. - explicit_gpt: If GPT-4o-mini labelled the first three turns of the conversation as explicit. ### Citation Information Please cite our paper if you use this dataset or found it useful.

提供机构：

neelgupta2112

5,000+

优质数据集

54 个

任务类型

进入经典数据集