LouisMreau/ina-french-dedup-synthetic

Name: LouisMreau/ina-french-dedup-synthetic
Creator: LouisMreau
Published: 2026-01-17 10:35:36
License: 暂无描述

Hugging Face2026-01-17 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/LouisMreau/ina-french-dedup-synthetic

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - fr license: mit size_categories: - 1M<n<10M task_categories: - text-classification tags: - deduplication - french - synthetic - transcription pretty_name: French Deduplication Synthetic Dataset --- # French Deduplication Synthetic Dataset A synthetic French dataset for training and evaluating text deduplication pipelines, simulating TV/radio transcription archives. ## Dataset Description - **Rows**: 8,997,319 - **Tokens**: ~357M (estimated) - **Language**: French - **Domain**: Simulated TV/radio transcriptions ## Schema | Column | Type | Description | |--------|------|-------------| | `s1` | string | Original sentence | | `s2` | string/null | Transformed sentence (or null for DISTINCT) | | `error_type` | int | 0=DISTINCT, 1=EXACT, 2=SIMPLE, 3=HARD | | `error_details` | string | Specific transformation type | ## Error Types | Type | Code | Count | % | Description | |------|------|-------|---|-------------| | DISTINCT | 0 | 4,486,989 | 49.9% | Unique sentence, s2=null | | EXACT | 1 | 717,797 | 8.0% | Exact duplicate, s2=s1 | | SIMPLE | 2 | 1,972,917 | 21.9% | Surface-level errors | | HARD | 3 | 1,819,616 | 20.2% | Transcription/ASR errors | ### SIMPLE Error Subtypes (deterministic Python transforms) - `casse`: lowercase conversion - `accents`: accent removal (é→e) - `chiffres_mots`: digits to words (20→vingt) - `mots_chiffres`: words to digits (vingt→20) - `ponctuation`: punctuation removal - `apostrophes`: apostrophe removal - `espaces`: spacing errors - `coquilles`: typos - `bruit`: [bruit]/[musique] insertion ### HARD Error Subtypes - `inaudible`: word replaced with [inaudible] - `troncature`: word truncated with ... - `omission`: word removed - `repetition`: word repeated (stuttering) - `partiel`: sentence cut with ... - `paraphrase`: semantic paraphrase (LLM-generated) ## Usage ```python from datasets import load_dataset ds = load_dataset("LouisMreau/ina-french-dedup-synthetic") # Filter by error type distinct = ds.filter(lambda x: x["error_type"] == 0) exact_dupes = ds.filter(lambda x: x["error_type"] == 1) simple_errors = ds.filter(lambda x: x["error_type"] == 2) hard_errors = ds.filter(lambda x: x["error_type"] == 3) # Get paraphrases only paraphrases = ds.filter(lambda x: x["error_details"] == "paraphrase") ``` ## Generation Method 1. **s1 generation**: LLM (Ministral-3B) with procedural context injection (67.5M combinations from 5 dictionaries) 2. **s2 for SIMPLE/HARD scripted**: Deterministic Python transforms 3. **s2 for paraphrase**: Two-pass LLM with few-shot prompting (99.6% success rate) ## Citation ``` @dataset{french_dedup_synthetic, title={French Deduplication Synthetic Dataset}, year={2026}, url={https://huggingface.co/datasets/LouisMreau/ina-french-dedup-synthetic} } ```

提供机构：

LouisMreau

5,000+

优质数据集

54 个

任务类型

进入经典数据集