ParthMandaliya/hotpotqa-wiki
收藏Hugging Face2026-01-29 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ParthMandaliya/hotpotqa-wiki
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
language:
- en
tags:
- wikipedia
- hotpotqa
- multi-hop
- human-readable
---
# HotpotQA Wikipedia Corpus
This dataset is a processed Wikipedia corpus derived from the **HotpotQA Wikipedia dump**.
The original dump was downloaded from:
- https://nlp.stanford.edu/projects/hotpotqa/enwiki-20171001-pages-meta-current-withlinks-processed.tar.bz2
Additional details about the structure and preprocessing of the original HotpotQA Wikipedia dump are available here:
- https://hotpotqa.github.io/wiki-readme.html
---
## Dataset Description
This dataset is built on top of the original HotpotQA Wikipedia dump without modifying the raw source content.
The original Wikipedia text is preserved as-is, and additional features are derived from it to make the dataset more convenient for downstream NLP and knowledge graph tasks.
In particular, two new features are created **by processing the original `text` field**, while keeping the original field intact.
---
## Features
Each row corresponds to a single Wikipedia article and contains the following fields:
- **`id`**
A unique identifier for the article.
- **`url`**
The original Wikipedia URL of the article.
- **`title`**
The article title.
- **`charoffset`**
Character-level offset information associated with the original `text` field.
- **`text`**
The original article content from the HotpotQA Wikipedia dump.
- **NOTE: The features `id`, `url`, `title`, `text`, and `charoffset` are unmodified.**
- **`article`**
A cleaned, plain-text version of the article derived from the `text` field.
HTML markup has been removed, and anchor tags have been replaced by their visible text to preserve semantic content and readability.
- **`links`**
A structured list of hyperlinks extracted from the original `text` field.
Each entry contains:
- the visible anchor text
- the corresponding Wikipedia link target (`href`)
- This feature provides human-curated entity references that are useful for entity linking and knowledge graph construction.
---
## Motivation and Use Cases
This dataset is designed to support a wide range of NLP and graph-based applications, including:
- Multi-hop question answering
- Retrieval-augmented generation (RAG)
- Knowledge graph construction
- Entity linking
- Relation extraction
- Large-scale analysis of Wikipedia content
By preserving the original raw text while also providing cleaned text and structured link metadata, the dataset enables flexible experimentation across different pipelines.
---
## Loading the Dataset
```python
from datasets import load_dataset
ds = load_dataset("ParthMandaliya/hotpotqa-wiki", streaming=True, split="train")
```
提供机构:
ParthMandaliya



