wiki_en
收藏魔搭社区2025-12-05 更新2025-06-28 收录
下载链接:
https://modelscope.cn/datasets/OrdalieTech/wiki_en
下载链接
链接失效反馈官方服务:
资源简介:
# French Wikipedia Corpus - Snapshot of April 20, 2025
## Dataset Description
This dataset contains a complete snapshot of the French-language Wikipedia encyclopedia, as it existed on April 20, 2025. It includes the latest version of each page, with its raw text content, the titles of linked pages, as well as a unique identifier.
The text of each article retains the MediaWiki formatting structure for titles (`== Section Title ==`), subtitles (`=== Subtitle ===`), and so on. This makes it particularly useful for tasks that can benefit from the document's hierarchical structure.
This corpus is ideal for training language models, information retrieval, question-answering, and any other Natural Language Processing (NLP) research requiring a large amount of structured, encyclopedic text.
## Dataset Structure
### Data Fields
The dataset is composed of the following columns:
* **`id`** (string): A unique identifier for each article (e.g., the Wikipedia page ID).
* **`title`** (string): The title of the Wikipedia article.
* **`text`** (string): The full text content of the article. The section structure is preserved with the `==`, `===`, `====`, etc. syntax.
* **`linked_titles`** (list of strings): A list containing the titles of other Wikipedia articles that are linked from the `text` field.
### Data Splits
The dataset contains only one split: `train`, which includes all the articles from the dump.
## Usage
You can easily load and use this dataset with the Hugging Face `datasets` library.
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("OrdalieTech/wiki_fr")
# Display information about the dataset
print(dataset)
# >>> DatasetDict({
# >>> train: Dataset({
# >>> features: ['id', 'title', 'text', 'linked_titles'],
# >>> num_rows: 2700000 # Example
# >>> })
# >>> })
# Access an example
first_article = dataset['train'][0]
print("Title:", first_article['title'])
print("\nText excerpt:", first_article['text'][:500])
print("\nLinked titles:", first_article['linked_titles'][:5])
# 法语维基百科语料库——2025年4月20日快照
## 数据集描述
本数据集包含2025年4月20日时点的完整法语维基百科快照,收录各页面的最新版本,包含原始文本内容、关联页面标题以及唯一标识符。
每篇文章的文本保留了MediaWiki格式的标题结构(`== 章节标题 ==`)、子标题(`=== 子标题 ===`)等层级格式,这使其特别适用于可利用文档层级结构的各类任务。
该语料库非常适合用于训练大语言模型(Large Language Model, LLM)、信息检索、问答系统,以及其他需要大量结构化百科文本的自然语言处理(Natural Language Processing, NLP)研究。
## 数据集结构
### 数据字段
本数据集由以下列组成:
* **`id`**(字符串类型):每篇文章的唯一标识符(例如维基百科页面ID)。
* **`title`**(字符串类型):维基百科文章的标题。
* **`text`**(字符串类型):文章的完整文本内容,保留了`==`、`===`、`====`等层级分隔符对应的结构。
* **`linked_titles`**(字符串列表):包含从`text`字段中链接到的其他维基百科文章标题的列表。
### 数据划分
本数据集仅包含一个划分:`train`(训练集),包含该快照中的全部文章。
## 使用方法
可通过Hugging Face的`datasets`库轻松加载并使用本数据集。
python
from datasets import load_dataset
# 加载数据集
dataset = load_dataset("OrdalieTech/wiki_fr")
# 打印数据集信息
print(dataset)
# >>> DatasetDict({
# >>> train: Dataset({
# >>> features: ['id', 'title', 'text', 'linked_titles'],
# >>> num_rows: 2700000 # 示例数值
# >>> })
# >>> })
# 访问单篇示例文章
first_article = dataset['train'][0]
print("标题:", first_article['title'])
print("
文本节选:", first_article['text'][:500])
print("
关联标题列表:", first_article['linked_titles'][:5])
提供机构:
maas
创建时间:
2025-06-26



