JulianKrgd/Wikipedia_FR_2M

Name: JulianKrgd/Wikipedia_FR_2M
Creator: JulianKrgd
Published: 2025-12-08 20:27:58
License: 暂无描述

Hugging Face2025-12-08 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/JulianKrgd/Wikipedia_FR_2M

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - fr license: cc-by-sa-4.0 size_categories: - 1M<n<10M task_categories: - text-generation - feature-extraction tags: - wikipedia - french - pretrain - llm pretty_name: Wikipedia FR dataset_info: features: - name: id dtype: string - name: title dtype: string - name: text dtype: string - name: url dtype: string - name: source dtype: string - name: language dtype: string - name: char_count dtype: int64 - name: word_count dtype: int64 - name: scraped_at dtype: string --- # Wikipedia FR Dataset French Wikipedia articles parsed from the official Wikimedia dump, cleaned and formatted for LLM pre-training. ## Dataset Description | Property | Value | |----------|-------| | **Articles** | 2,368,933 | | **Language** | French | | **Size** | 7.7 GB | | **Format** | JSONL | | **Source** | Wikipedia(December 2025) | | **License** | CC BY-SA 4.0 | ## Data Format Each line is a JSON object with the following fields: ```json { "id": "wikipedia_fr_12345", "title": "Intelligence artificielle", "text": "L'intelligence artificielle est un domaine...", "url": "https://fr.wikipedia.org/wiki/Intelligence_artificielle", "source": "wikipedia_fr", "language": "fr", "char_count": 15234, "word_count": 2341, "scraped_at": "2024-12-08T19:30:00" } ``` ## Usage ### With Hugging Face Datasets ```python from datasets import load_dataset dataset = load_dataset("JulianKrgd/Wikipedia-Fr") ``` ### Direct JSONL loading ```python import json with open("wikipedia_fr_dump.jsonl", "r") as f: for line in f: article = json.loads(line) print(article["title"]) ``` ## Processing Details - Parsed from official Wikimedia XML dump - Removed redirects, stubs, and non-article pages - Cleaned wikitext markup (templates, references, HTML) - Filtered articles with < 200 characters or < 50 words - Preserved article structure and plain text content ## Intended Use - LLM pre-training - French NLP research - Text generation fine-tuning - Knowledge extraction ## Citation If you use this dataset, please cite: ```bibtex @dataset{wikipedia_fr_2024, title={Wikipedia FR Dataset}, author={JulianKrgd}, year={2025}, url={https://huggingface.co/datasets/JulianKrgd/Wikipedia-Fr} } ``` ## Acknowledgments Data sourced from [Wikimedia Foundation](https://dumps.wikimedia.org/).

提供机构：

JulianKrgd

5,000+

优质数据集

54 个

任务类型

进入经典数据集