DerekChai/lmsys-chat-privacy-20k
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DerekChai/lmsys-chat-privacy-20k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: id
dtype: string
- name: conversation
dtype: string
- name: conversation_model
dtype: string
- name: language
dtype: string
- name: redacted
dtype: bool
- name: turn
dtype: int64
- name: fine_detection
struct:
- name: is_sensitive
dtype: int64
- name: model
dtype: string
- name: privacy_labels
dtype: string
- name: reasons
dtype: string
- name: coarse_detection
struct:
- name: is_sensitive
dtype: int64
- name: model
dtype: string
- name: reasons
dtype: string
- name: noise_detection
struct:
- name: avg_logprob
dtype: float64
- name: is_noise
dtype: bool
- name: model
dtype: string
- name: reasons
dtype: string
- name: qwen3.5-plus
dtype: string
- name: openai/gpt-oss-safeguard-120b
dtype: string
- name: MiniMaxAI/MiniMax-M2.5
dtype: string
- name: glm-4.7-fp8
dtype: string
- name: ground_truth
dtype: string
splits:
- name: train
num_bytes: 1248502073
num_examples: 20931
download_size: 482416784
dataset_size: 1248502073
---
# Lethe: LMSYS-Chat-1M Privacy Dataset
## Dataset Summary
**Lethe** is a curated subset of the [lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) dataset, specifically designed for privacy information detection research in open-domain conversational data. It contains **20K high-quality conversations** where privacy-sensitive information was systematically identified, categorized, and validated through a multi-stage filtering pipeline and multi-model majority voting.
The dataset provides fine-grained privacy labels across **15 privacy categories** and **5 subject categories**, with exact substring annotations (`leaked_text`) and contextual snippets to support entity localization tasks.
## Dataset Creation Pipeline
The dataset was constructed through a rigorous multi-stage pipeline:
| Step | Samples | Retention Rate | Description |
| -------------------------------- | --------- | -------------- | ------------------------------------------------------------ |
| **0. Source Data** | 1,000,000 | 100% | Full `lmsys-chat-1m` training split. |
| **1. Coarse Detection** | 74,090 | 7.4% | Coarse-grained binary privacy classification. Samples with any forms of potential privacy leakages are kept for a high recall. (using get-oss-safeguard 120B) |
| **2. Noise Filter** | 61,913 | 83.6% | Removed harmful/inappropriate content (violence, sexual content, hate speech, etc.) — **12,177 samples filtered out**, using get-oss-safeguard 20B. |
| **3. Fine Detection & Curation** | 20,931 | 33.8% | Fine-grained identification using Qwen3-max and only samples with real privacy leakage are preserved. |
| **4. Aggregation (`voted`)** | 20,931 | 100% | Multi-model majority voting by 3 LLMs (Qwen3.5 Plus, GPT-oss-safeguard 120B, and GLM-4.7-FP8) to produce high-confidence `ground_truth` labels. |
## Supported Tasks
- **Privacy Leak Detection**: Binary and multi-class classification of privacy-sensitive content in dialogues.
- **Named Entity Recognition (PII-NER)**: Exact substring extraction of leaked personal information.
- **Privacy Category Classification**: Fine-grained categorization into 15 privacy domains.
- **Subject Attribution**: Identifying who the leaked information belongs to (self, close contacts, professionals, etc.).
- **Conversation Safety & Moderation**: Benchmarking noise and harmful content detection in real user-LLM interactions.
## Dataset Structure
### Data Fields
#### Core Fields
| Field | Type | Description |
| -------------------- | --------------- | ------------------------------------------------------------ |
| `id` | `string` | Unique conversation identifier (aligned with `conversation_id` in the source dataset). |
| `conversation` | `string` (JSON) | The full conversation as a JSON-stringified list of `{role, content}` turns. |
| `conversation_model` | `string` | The LLM model that generated the assistant responses in the original conversation. |
| `language` | `string` | Detected language code (e.g., `English`). |
| `redacted` | `bool` | Whether the original conversation contained redacted placeholders (e.g., `NAME_1`). |
| `turn` | `int` | Number of turns in the conversation. |
| `fine_detection` | `struct` | Structured output from fine-grained privacy detection. Contains `is_sensitive`, `privacy_labels` (JSON string), and `reasons`. |
| `coarse_detection` | `struct` | Coarse binary detection result (`is_sensitive`, `reasons`, `model`, `raw_output`). |
| `noise_detection` | `struct` | Noise filter result (`is_noise`, `reasons`, `avg_logprob`). |
#### Annotation Fields
| Field | Type | Description |
| ------------------------------- | -------- | ------------------------------------------------------------ |
| `qwen3.5-plus` | `struct` | Labels generated by Qwen 3.5 Plus. |
| `openai/gpt-oss-safeguard-120b` | `struct` | Labels generated by GPT-OSS Safeguard 120B. |
| `glm-4.7-fp8` | `struct` | Labels generated by GLM-4.7-FP8. |
| `ground_truth` | `struct` | **Majority-vote consensus** across the three annotator models. |
### `ground_truth` Schema
```json
{
"leaks": [
{
"leaked_text": "Alice Johnson",
"context_snippet": "my name is Alice Johnson and I live in",
"privacy_category": "C01",
"subject_category": "S01"
}
]
}
```
### Category Taxonomy
#### Privacy Categories (`privacy_category`)
| Code | Name |
| ----- | --------------------------------------- |
| `C01` | Core Identity & Biographics |
| `C02` | Contact Information |
| `C03` | Physical Location Data |
| `C04` | Health & Medical Records |
| `C05` | Financial & Transactional Data |
| `C06` | Legal & Government Records |
| `C07` | Professional & Educational History |
| `C08` | Digital Authentication Secrets |
| `C09` | Personal Beliefs & Affiliations |
| `C10` | Lifestyle Choices & Preferences |
| `C11` | Social Relationships & Household |
| `C12` | Behavioral Patterns & Digital Footprint |
| `C13` | System & Device Environment |
| `C14` | Private Content & Communications |
| `C15` | Others |
#### Subject Categories (`subject_category`)
| Code | Name | Description |
| ----- | -------------------- | ------------------------------------------------------------ |
| `S01` | Self | The speaker/author of the text. |
| `S02` | Close Contact | Family members, partners, close friends. |
| `S03` | Professional Contact | Colleagues, bosses, clients, doctors. |
| `S04` | Casual Contact | Acquaintances, service staff, strangers mentioned in passing. |
| `S05` | Third Party | Organizations, companies, or public figures. |
## Data Splits
This dataset contains a single split:
| Split | Num Examples |
| ------- | ------------ |
| `train` | 20,931 |
All samples passed through the full curation and aggregation pipeline. There is no separate validation or test split; researchers are encouraged to create their own stratified splits based on privacy categories or subject categories.
## Data Collection and Curation
### Source Data
- **Original Dataset**: [lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)
- **License**: The original dataset is released under CC BY-NC 4.0. This curated derivative respects the original license terms.
- **Collection Method**: Real user-LLM conversations collected through the LMSYS Chatbot Arena platform.
### Annotation Process
1. **Coarse Detection**: A smaller local model (via vLLM) performed high-recall binary classification on all 1M conversations.
2. **Noise Filtering**: A vLLM-based classifier removed conversations containing violence, sexual content, hate speech, illegal activities, or manipulative behavior.
3. **Fine-Grained Detection & Curation**: A more capable model generated structured JSON annotations for each remaining conversation, identifying subjects, privacy categories, and specific leaked attributes. Only conversations with `is_sensitive=1` and attributes that cleanly map to 15 core privacy categories were retained.
4. **Multi-Model Labeling**: Three distinct LLMs (Qwen 3.5 Plus, GPT-OSS 120B, glm-4.7) independently annotated the curated `v1` set using the same `PRIVACY_LABEL_PROMPT_JSON` schema.
5. **Majority Voting**: The `ground_truth` field was derived via fuzzy string matching (LCS) on `leaked_text` combined with exact category matching, requiring agreement from **≥2 models**.
### Evaluation Metrics
The dataset supports four levels of evaluation:
1. **Binary Detection**: Sample-level accuracy / precision / recall / F1 (`has leak` vs `no leak`).
2. **Text Identification**: Character-level LCS overlap on `leaked_text`.
3. **Privacy Categories**: Text overlap + exact `privacy_category` match.
4. **Subject Categories**: Text overlap + exact `privacy_category` + `subject_category` match.
## Personal and Sensitive Information
**This dataset explicitly contains personal and sensitive information.** By design, it includes conversations where users disclosed real or realistic PII (names, locations, health conditions, financial details, etc.) to LLMs.
- All data is derived from **already-public** research datasets (`lmsys-chat-1m`).
- No new data was collected from individuals for this curation effort.
- The original source dataset performed automated redaction of explicit names (e.g., `NAME_1` placeholders), but users may still disclose identifying information in free-form text that was not fully redacted.
- **Use with caution**: This dataset is intended for research into privacy detection, data loss prevention, and conversation safety. It should **not** be used to train models for extracting PII from private conversations without ethical oversight.
## Considerations for Using the Data
### Social Impact
This dataset enables researchers to:
- Develop better **privacy guardrails** for LLMs.
- Benchmark **PII detection** models on realistic user-LLM interactions.
- Study how users inadvertently disclose sensitive information in open-domain conversations.
### Limitations
- **English-centric**: The majority of labeled samples are in English.
- **Annotation Noise**: Despite multi-model voting, some labels may still contain hallucinations or imprecise boundaries. The `ground_truth` represents consensus, not absolute fact.
- **Temporal Drift**: The source data was collected up to a certain point in time; newer conversation patterns or privacy norms may not be fully represented.
### Licensing
This derivative dataset is released under the same terms as the source data: **CC BY-NC 4.0** (or as specified by the original `lmsys-chat-1m` license). The code used to generate it is licensed under **Apache-2.0**.
提供机构:
DerekChai



