JulianKrgd/Wikipedia_FR_2M
收藏Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/JulianKrgd/Wikipedia_FR_2M
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- fr
license: cc-by-sa-4.0
size_categories:
- 1M<n<10M
task_categories:
- text-generation
- feature-extraction
tags:
- wikipedia
- french
- pretrain
- llm
pretty_name: Wikipedia FR
dataset_info:
features:
- name: id
dtype: string
- name: title
dtype: string
- name: text
dtype: string
- name: url
dtype: string
- name: source
dtype: string
- name: language
dtype: string
- name: char_count
dtype: int64
- name: word_count
dtype: int64
- name: scraped_at
dtype: string
---
# Wikipedia FR Dataset
French Wikipedia articles parsed from the official Wikimedia dump, cleaned and formatted for LLM pre-training.
## Dataset Description
| Property | Value |
|----------|-------|
| **Articles** | 2,368,933 |
| **Language** | French |
| **Size** | 7.7 GB |
| **Format** | JSONL |
| **Source** | Wikipedia(December 2025) |
| **License** | CC BY-SA 4.0 |
## Data Format
Each line is a JSON object with the following fields:
```json
{
"id": "wikipedia_fr_12345",
"title": "Intelligence artificielle",
"text": "L'intelligence artificielle est un domaine...",
"url": "https://fr.wikipedia.org/wiki/Intelligence_artificielle",
"source": "wikipedia_fr",
"language": "fr",
"char_count": 15234,
"word_count": 2341,
"scraped_at": "2024-12-08T19:30:00"
}
```
## Usage
### With Hugging Face Datasets
```python
from datasets import load_dataset
dataset = load_dataset("JulianKrgd/Wikipedia-Fr")
```
### Direct JSONL loading
```python
import json
with open("wikipedia_fr_dump.jsonl", "r") as f:
for line in f:
article = json.loads(line)
print(article["title"])
```
## Processing Details
- Parsed from official Wikimedia XML dump
- Removed redirects, stubs, and non-article pages
- Cleaned wikitext markup (templates, references, HTML)
- Filtered articles with < 200 characters or < 50 words
- Preserved article structure and plain text content
## Intended Use
- LLM pre-training
- French NLP research
- Text generation fine-tuning
- Knowledge extraction
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{wikipedia_fr_2024,
title={Wikipedia FR Dataset},
author={JulianKrgd},
year={2025},
url={https://huggingface.co/datasets/JulianKrgd/Wikipedia-Fr}
}
```
## Acknowledgments
Data sourced from [Wikimedia Foundation](https://dumps.wikimedia.org/).
提供机构:
JulianKrgd



