LouisMreau/ina-french-dedup-synthetic
收藏Hugging Face2026-01-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LouisMreau/ina-french-dedup-synthetic
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- fr
license: mit
size_categories:
- 1M<n<10M
task_categories:
- text-classification
tags:
- deduplication
- french
- synthetic
- transcription
pretty_name: French Deduplication Synthetic Dataset
---
# French Deduplication Synthetic Dataset
A synthetic French dataset for training and evaluating text deduplication pipelines, simulating TV/radio transcription archives.
## Dataset Description
- **Rows**: 8,997,319
- **Tokens**: ~357M (estimated)
- **Language**: French
- **Domain**: Simulated TV/radio transcriptions
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `s1` | string | Original sentence |
| `s2` | string/null | Transformed sentence (or null for DISTINCT) |
| `error_type` | int | 0=DISTINCT, 1=EXACT, 2=SIMPLE, 3=HARD |
| `error_details` | string | Specific transformation type |
## Error Types
| Type | Code | Count | % | Description |
|------|------|-------|---|-------------|
| DISTINCT | 0 | 4,486,989 | 49.9% | Unique sentence, s2=null |
| EXACT | 1 | 717,797 | 8.0% | Exact duplicate, s2=s1 |
| SIMPLE | 2 | 1,972,917 | 21.9% | Surface-level errors |
| HARD | 3 | 1,819,616 | 20.2% | Transcription/ASR errors |
### SIMPLE Error Subtypes (deterministic Python transforms)
- `casse`: lowercase conversion
- `accents`: accent removal (é→e)
- `chiffres_mots`: digits to words (20→vingt)
- `mots_chiffres`: words to digits (vingt→20)
- `ponctuation`: punctuation removal
- `apostrophes`: apostrophe removal
- `espaces`: spacing errors
- `coquilles`: typos
- `bruit`: [bruit]/[musique] insertion
### HARD Error Subtypes
- `inaudible`: word replaced with [inaudible]
- `troncature`: word truncated with ...
- `omission`: word removed
- `repetition`: word repeated (stuttering)
- `partiel`: sentence cut with ...
- `paraphrase`: semantic paraphrase (LLM-generated)
## Usage
```python
from datasets import load_dataset
ds = load_dataset("LouisMreau/ina-french-dedup-synthetic")
# Filter by error type
distinct = ds.filter(lambda x: x["error_type"] == 0)
exact_dupes = ds.filter(lambda x: x["error_type"] == 1)
simple_errors = ds.filter(lambda x: x["error_type"] == 2)
hard_errors = ds.filter(lambda x: x["error_type"] == 3)
# Get paraphrases only
paraphrases = ds.filter(lambda x: x["error_details"] == "paraphrase")
```
## Generation Method
1. **s1 generation**: LLM (Ministral-3B) with procedural context injection (67.5M combinations from 5 dictionaries)
2. **s2 for SIMPLE/HARD scripted**: Deterministic Python transforms
3. **s2 for paraphrase**: Two-pass LLM with few-shot prompting (99.6% success rate)
## Citation
```
@dataset{french_dedup_synthetic,
title={French Deduplication Synthetic Dataset},
year={2026},
url={https://huggingface.co/datasets/LouisMreau/ina-french-dedup-synthetic}
}
```
提供机构:
LouisMreau



