five

ChaoticEconomist/EnglishtoFrench-Translation-Dataset

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ChaoticEconomist/EnglishtoFrench-Translation-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - fr license: cc-by-4.0 task_categories: - translation - text-generation tags: - translation - french - english - sft - lora - instruction-tuning - alpaca-format - synthetic size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: data/train.csv - split: validation path: data/validation.csv - split: test path: data/test.csv pretty_name: Eng2Fren --- # English–French Translation Dataset (SFT / LoRA Ready) A clean, structured dataset of **50,000 English–French sentence pairs** designed for supervised fine-tuning (SFT) of large language models, LoRA adapters, and general machine translation tasks. --- ## Overview | Property | Value | |------------------|---------------------------------| | Language pair | English → French | | Total rows | 50,000 | | Train split | 45,000 (90%) | | Validation split | 2,500 (5%) | | Test split | 2,500 (5%) | | Format | CSV (Alpaca-style prompt format) | | License | CC BY 4.0 | --- ## Dataset Description This dataset covers a wide range of everyday topics and linguistic structures, from simple greetings and common phrases to more complex sentences involving work, travel, health, and culture. Sentences are grammatically correct in both English and French, with proper handling of: - French gender agreement (masculine/feminine adjectives and articles) - Elision rules (`Je` → `J'` before vowels, `de` → `d'` before vowels) - Correct article usage (`Le`, `La`, `L'`, `Les`) - BANGS adjective placement (e.g. *beau/belle* placed before the noun) - Proper verb conjugation across first and third person - Singular/plural agreement --- ## Columns | Column | Type | Description | |---------------|--------|-------------| | `english` | string | Source sentence in English | | `french` | string | Target translation in French | | `instruction` | string | Task instruction: *"Translate the following English sentence into French."* | | `prompt` | string | Alpaca-style prompt (`### Instruction / ### Input / ### Response:`) | | `completion` | string | The French translation — the expected model output | | `text` | string | Full `prompt + completion` string for SFT trainers | | `source_lang` | string | Always `en` | | `target_lang` | string | Always `fr` | | `category` | string | Semantic category of the sentence (see below) | | `difficulty` | string | `beginner`, `intermediate`, or `advanced` | | `word_count` | int | Number of words in the English sentence | | `char_count` | int | Number of characters in the English sentence | | `split` | string | `train`, `validation`, or `test` | --- ## Categories Sentences are labelled across 13 semantic categories: | Category | Description | |----------------------------|-------------| | `greetings` | Hellos, goodbyes, introductions | | `food_and_dining` | Restaurants, ordering, dietary preferences | | `travel` | Transport, hotels, airports, navigation | | `health` | Medical, pharmacy, symptoms | | `work` | Office, jobs, meetings, contracts | | `finance` | Banking, payments, currency | | `technology` | Phones, internet, computers, AI | | `education` | Study, school, books, language learning | | `arts_and_culture` | Music, painting, photography, cinema | | `family_and_relationships` | Family members, friendships, emotions | | `celebrations` | Birthdays, holidays, congratulations | | `nature_and_weather` | Seasons, environment, outdoors | | `general` | Everyday expressions and common phrases | --- ## Difficulty Levels | Level | Criteria | |----------------|----------| | `beginner` | Short sentences (≤5 words), simple vocabulary | | `intermediate` | Medium sentences (6–10 words), everyday vocabulary | | `advanced` | Longer sentences with complex structure | --- ## Usage ### Load with 🤗 Datasets ```python from datasets import load_dataset ds = load_dataset("csv", data_files="french_dataset_ml.csv") ``` ### Filter by split ```python train = ds.filter(lambda x: x["split"] == "train") val = ds.filter(lambda x: x["split"] == "validation") test = ds.filter(lambda x: x["split"] == "test") ``` ### Fine-tune with TRL SFTTrainer ```python from trl import SFTTrainer trainer = SFTTrainer( model=model, train_dataset=train, dataset_text_field="text", # uses the full prompt+completion field ... ) ``` ### Fine-tune with Axolotl ```yaml datasets: - path: french_dataset_ml.csv type: alpaca data_files: french_dataset_ml.csv ``` ### Fine-tune with LLaMA-Factory ```json { "dataset": "french_dataset_ml", "dataset_format": "alpaca", "text_field": "text" } ``` --- ## Prompt Format All rows follow the **Alpaca instruction format**: ``` ### Instruction: Translate the following English sentence into French. ### Input: The city is beautiful. ### Response: La ville est belle. ``` This format is compatible with LLaMA, Mistral, Phi, Qwen, and most instruction-tuned base models. --- ## Data Construction Sentences were constructed and curated to ensure grammatical correctness in both languages. The dataset prioritises **linguistic accuracy** — every sentence has been validated for correct French grammar rules including gender agreement, elision, verb conjugation, and adjective placement. --- ## Intended Uses - **Translation fine-tuning** — teach a model to translate English → French - **LoRA adapter training** — lightweight language-specific adapters - **Instruction following** — the Alpaca format makes it suitable for general SFT - **Benchmarking** — use the pre-split `test` set for evaluation - **Educational tools** — language learning applications ## Out-of-Scope Uses This dataset covers common everyday language. It is not suitable for domain-specific translation (legal, medical, technical) without augmentation. --- ## License [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)
提供机构:
ChaoticEconomist
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作