five

LouisMreau/ina-french-dedup-synthetic

收藏
Hugging Face2026-01-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LouisMreau/ina-french-dedup-synthetic
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - fr license: mit size_categories: - 1M<n<10M task_categories: - text-classification tags: - deduplication - french - synthetic - transcription pretty_name: French Deduplication Synthetic Dataset --- # French Deduplication Synthetic Dataset A synthetic French dataset for training and evaluating text deduplication pipelines, simulating TV/radio transcription archives. ## Dataset Description - **Rows**: 8,997,319 - **Tokens**: ~357M (estimated) - **Language**: French - **Domain**: Simulated TV/radio transcriptions ## Schema | Column | Type | Description | |--------|------|-------------| | `s1` | string | Original sentence | | `s2` | string/null | Transformed sentence (or null for DISTINCT) | | `error_type` | int | 0=DISTINCT, 1=EXACT, 2=SIMPLE, 3=HARD | | `error_details` | string | Specific transformation type | ## Error Types | Type | Code | Count | % | Description | |------|------|-------|---|-------------| | DISTINCT | 0 | 4,486,989 | 49.9% | Unique sentence, s2=null | | EXACT | 1 | 717,797 | 8.0% | Exact duplicate, s2=s1 | | SIMPLE | 2 | 1,972,917 | 21.9% | Surface-level errors | | HARD | 3 | 1,819,616 | 20.2% | Transcription/ASR errors | ### SIMPLE Error Subtypes (deterministic Python transforms) - `casse`: lowercase conversion - `accents`: accent removal (é→e) - `chiffres_mots`: digits to words (20→vingt) - `mots_chiffres`: words to digits (vingt→20) - `ponctuation`: punctuation removal - `apostrophes`: apostrophe removal - `espaces`: spacing errors - `coquilles`: typos - `bruit`: [bruit]/[musique] insertion ### HARD Error Subtypes - `inaudible`: word replaced with [inaudible] - `troncature`: word truncated with ... - `omission`: word removed - `repetition`: word repeated (stuttering) - `partiel`: sentence cut with ... - `paraphrase`: semantic paraphrase (LLM-generated) ## Usage ```python from datasets import load_dataset ds = load_dataset("LouisMreau/ina-french-dedup-synthetic") # Filter by error type distinct = ds.filter(lambda x: x["error_type"] == 0) exact_dupes = ds.filter(lambda x: x["error_type"] == 1) simple_errors = ds.filter(lambda x: x["error_type"] == 2) hard_errors = ds.filter(lambda x: x["error_type"] == 3) # Get paraphrases only paraphrases = ds.filter(lambda x: x["error_details"] == "paraphrase") ``` ## Generation Method 1. **s1 generation**: LLM (Ministral-3B) with procedural context injection (67.5M combinations from 5 dictionaries) 2. **s2 for SIMPLE/HARD scripted**: Deterministic Python transforms 3. **s2 for paraphrase**: Two-pass LLM with few-shot prompting (99.6% success rate) ## Citation ``` @dataset{french_dedup_synthetic, title={French Deduplication Synthetic Dataset}, year={2026}, url={https://huggingface.co/datasets/LouisMreau/ina-french-dedup-synthetic} } ```
提供机构:
LouisMreau
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作