five

DerekChai/lmsys-chat-privacy-20k

收藏
Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DerekChai/lmsys-chat-privacy-20k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: id dtype: string - name: conversation dtype: string - name: conversation_model dtype: string - name: language dtype: string - name: redacted dtype: bool - name: turn dtype: int64 - name: fine_detection struct: - name: is_sensitive dtype: int64 - name: model dtype: string - name: privacy_labels dtype: string - name: reasons dtype: string - name: coarse_detection struct: - name: is_sensitive dtype: int64 - name: model dtype: string - name: reasons dtype: string - name: noise_detection struct: - name: avg_logprob dtype: float64 - name: is_noise dtype: bool - name: model dtype: string - name: reasons dtype: string - name: qwen3.5-plus dtype: string - name: openai/gpt-oss-safeguard-120b dtype: string - name: MiniMaxAI/MiniMax-M2.5 dtype: string - name: glm-4.7-fp8 dtype: string - name: ground_truth dtype: string splits: - name: train num_bytes: 1248502073 num_examples: 20931 download_size: 482416784 dataset_size: 1248502073 --- # Lethe: LMSYS-Chat-1M Privacy Dataset ## Dataset Summary **Lethe** is a curated subset of the [lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) dataset, specifically designed for privacy information detection research in open-domain conversational data. It contains **20K high-quality conversations** where privacy-sensitive information was systematically identified, categorized, and validated through a multi-stage filtering pipeline and multi-model majority voting. The dataset provides fine-grained privacy labels across **15 privacy categories** and **5 subject categories**, with exact substring annotations (`leaked_text`) and contextual snippets to support entity localization tasks. ## Dataset Creation Pipeline The dataset was constructed through a rigorous multi-stage pipeline: | Step | Samples | Retention Rate | Description | | -------------------------------- | --------- | -------------- | ------------------------------------------------------------ | | **0. Source Data** | 1,000,000 | 100% | Full `lmsys-chat-1m` training split. | | **1. Coarse Detection** | 74,090 | 7.4% | Coarse-grained binary privacy classification. Samples with any forms of potential privacy leakages are kept for a high recall. (using get-oss-safeguard 120B) | | **2. Noise Filter** | 61,913 | 83.6% | Removed harmful/inappropriate content (violence, sexual content, hate speech, etc.) — **12,177 samples filtered out**, using get-oss-safeguard 20B. | | **3. Fine Detection & Curation** | 20,931 | 33.8% | Fine-grained identification using Qwen3-max and only samples with real privacy leakage are preserved. | | **4. Aggregation (`voted`)** | 20,931 | 100% | Multi-model majority voting by 3 LLMs (Qwen3.5 Plus, GPT-oss-safeguard 120B, and GLM-4.7-FP8) to produce high-confidence `ground_truth` labels. | ## Supported Tasks - **Privacy Leak Detection**: Binary and multi-class classification of privacy-sensitive content in dialogues. - **Named Entity Recognition (PII-NER)**: Exact substring extraction of leaked personal information. - **Privacy Category Classification**: Fine-grained categorization into 15 privacy domains. - **Subject Attribution**: Identifying who the leaked information belongs to (self, close contacts, professionals, etc.). - **Conversation Safety & Moderation**: Benchmarking noise and harmful content detection in real user-LLM interactions. ## Dataset Structure ### Data Fields #### Core Fields | Field | Type | Description | | -------------------- | --------------- | ------------------------------------------------------------ | | `id` | `string` | Unique conversation identifier (aligned with `conversation_id` in the source dataset). | | `conversation` | `string` (JSON) | The full conversation as a JSON-stringified list of `{role, content}` turns. | | `conversation_model` | `string` | The LLM model that generated the assistant responses in the original conversation. | | `language` | `string` | Detected language code (e.g., `English`). | | `redacted` | `bool` | Whether the original conversation contained redacted placeholders (e.g., `NAME_1`). | | `turn` | `int` | Number of turns in the conversation. | | `fine_detection` | `struct` | Structured output from fine-grained privacy detection. Contains `is_sensitive`, `privacy_labels` (JSON string), and `reasons`. | | `coarse_detection` | `struct` | Coarse binary detection result (`is_sensitive`, `reasons`, `model`, `raw_output`). | | `noise_detection` | `struct` | Noise filter result (`is_noise`, `reasons`, `avg_logprob`). | #### Annotation Fields | Field | Type | Description | | ------------------------------- | -------- | ------------------------------------------------------------ | | `qwen3.5-plus` | `struct` | Labels generated by Qwen 3.5 Plus. | | `openai/gpt-oss-safeguard-120b` | `struct` | Labels generated by GPT-OSS Safeguard 120B. | | `glm-4.7-fp8` | `struct` | Labels generated by GLM-4.7-FP8. | | `ground_truth` | `struct` | **Majority-vote consensus** across the three annotator models. | ### `ground_truth` Schema ```json { "leaks": [ { "leaked_text": "Alice Johnson", "context_snippet": "my name is Alice Johnson and I live in", "privacy_category": "C01", "subject_category": "S01" } ] } ``` ### Category Taxonomy #### Privacy Categories (`privacy_category`) | Code | Name | | ----- | --------------------------------------- | | `C01` | Core Identity & Biographics | | `C02` | Contact Information | | `C03` | Physical Location Data | | `C04` | Health & Medical Records | | `C05` | Financial & Transactional Data | | `C06` | Legal & Government Records | | `C07` | Professional & Educational History | | `C08` | Digital Authentication Secrets | | `C09` | Personal Beliefs & Affiliations | | `C10` | Lifestyle Choices & Preferences | | `C11` | Social Relationships & Household | | `C12` | Behavioral Patterns & Digital Footprint | | `C13` | System & Device Environment | | `C14` | Private Content & Communications | | `C15` | Others | #### Subject Categories (`subject_category`) | Code | Name | Description | | ----- | -------------------- | ------------------------------------------------------------ | | `S01` | Self | The speaker/author of the text. | | `S02` | Close Contact | Family members, partners, close friends. | | `S03` | Professional Contact | Colleagues, bosses, clients, doctors. | | `S04` | Casual Contact | Acquaintances, service staff, strangers mentioned in passing. | | `S05` | Third Party | Organizations, companies, or public figures. | ## Data Splits This dataset contains a single split: | Split | Num Examples | | ------- | ------------ | | `train` | 20,931 | All samples passed through the full curation and aggregation pipeline. There is no separate validation or test split; researchers are encouraged to create their own stratified splits based on privacy categories or subject categories. ## Data Collection and Curation ### Source Data - **Original Dataset**: [lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) - **License**: The original dataset is released under CC BY-NC 4.0. This curated derivative respects the original license terms. - **Collection Method**: Real user-LLM conversations collected through the LMSYS Chatbot Arena platform. ### Annotation Process 1. **Coarse Detection**: A smaller local model (via vLLM) performed high-recall binary classification on all 1M conversations. 2. **Noise Filtering**: A vLLM-based classifier removed conversations containing violence, sexual content, hate speech, illegal activities, or manipulative behavior. 3. **Fine-Grained Detection & Curation**: A more capable model generated structured JSON annotations for each remaining conversation, identifying subjects, privacy categories, and specific leaked attributes. Only conversations with `is_sensitive=1` and attributes that cleanly map to 15 core privacy categories were retained. 4. **Multi-Model Labeling**: Three distinct LLMs (Qwen 3.5 Plus, GPT-OSS 120B, glm-4.7) independently annotated the curated `v1` set using the same `PRIVACY_LABEL_PROMPT_JSON` schema. 5. **Majority Voting**: The `ground_truth` field was derived via fuzzy string matching (LCS) on `leaked_text` combined with exact category matching, requiring agreement from **≥2 models**. ### Evaluation Metrics The dataset supports four levels of evaluation: 1. **Binary Detection**: Sample-level accuracy / precision / recall / F1 (`has leak` vs `no leak`). 2. **Text Identification**: Character-level LCS overlap on `leaked_text`. 3. **Privacy Categories**: Text overlap + exact `privacy_category` match. 4. **Subject Categories**: Text overlap + exact `privacy_category` + `subject_category` match. ## Personal and Sensitive Information **This dataset explicitly contains personal and sensitive information.** By design, it includes conversations where users disclosed real or realistic PII (names, locations, health conditions, financial details, etc.) to LLMs. - All data is derived from **already-public** research datasets (`lmsys-chat-1m`). - No new data was collected from individuals for this curation effort. - The original source dataset performed automated redaction of explicit names (e.g., `NAME_1` placeholders), but users may still disclose identifying information in free-form text that was not fully redacted. - **Use with caution**: This dataset is intended for research into privacy detection, data loss prevention, and conversation safety. It should **not** be used to train models for extracting PII from private conversations without ethical oversight. ## Considerations for Using the Data ### Social Impact This dataset enables researchers to: - Develop better **privacy guardrails** for LLMs. - Benchmark **PII detection** models on realistic user-LLM interactions. - Study how users inadvertently disclose sensitive information in open-domain conversations. ### Limitations - **English-centric**: The majority of labeled samples are in English. - **Annotation Noise**: Despite multi-model voting, some labels may still contain hallucinations or imprecise boundaries. The `ground_truth` represents consensus, not absolute fact. - **Temporal Drift**: The source data was collected up to a certain point in time; newer conversation patterns or privacy norms may not be fully represented. ### Licensing This derivative dataset is released under the same terms as the source data: **CC BY-NC 4.0** (or as specified by the original `lmsys-chat-1m` license). The code used to generate it is licensed under **Apache-2.0**.
提供机构:
DerekChai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作