ParthMandaliya/hotpotqa-wiki

Name: ParthMandaliya/hotpotqa-wiki
Creator: ParthMandaliya
Published: 2026-01-29 11:56:24
License: 暂无描述

Hugging Face2026-01-29 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ParthMandaliya/hotpotqa-wiki

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 language: - en tags: - wikipedia - hotpotqa - multi-hop - human-readable --- # HotpotQA Wikipedia Corpus This dataset is a processed Wikipedia corpus derived from the **HotpotQA Wikipedia dump**. The original dump was downloaded from: - https://nlp.stanford.edu/projects/hotpotqa/enwiki-20171001-pages-meta-current-withlinks-processed.tar.bz2 Additional details about the structure and preprocessing of the original HotpotQA Wikipedia dump are available here: - https://hotpotqa.github.io/wiki-readme.html --- ## Dataset Description This dataset is built on top of the original HotpotQA Wikipedia dump without modifying the raw source content. The original Wikipedia text is preserved as-is, and additional features are derived from it to make the dataset more convenient for downstream NLP and knowledge graph tasks. In particular, two new features are created **by processing the original `text` field**, while keeping the original field intact. --- ## Features Each row corresponds to a single Wikipedia article and contains the following fields: - **`id`** A unique identifier for the article. - **`url`** The original Wikipedia URL of the article. - **`title`** The article title. - **`charoffset`** Character-level offset information associated with the original `text` field. - **`text`** The original article content from the HotpotQA Wikipedia dump. - **NOTE: The features `id`, `url`, `title`, `text`, and `charoffset` are unmodified.** - **`article`** A cleaned, plain-text version of the article derived from the `text` field. HTML markup has been removed, and anchor tags have been replaced by their visible text to preserve semantic content and readability. - **`links`** A structured list of hyperlinks extracted from the original `text` field. Each entry contains: - the visible anchor text - the corresponding Wikipedia link target (`href`) - This feature provides human-curated entity references that are useful for entity linking and knowledge graph construction. --- ## Motivation and Use Cases This dataset is designed to support a wide range of NLP and graph-based applications, including: - Multi-hop question answering - Retrieval-augmented generation (RAG) - Knowledge graph construction - Entity linking - Relation extraction - Large-scale analysis of Wikipedia content By preserving the original raw text while also providing cleaned text and structured link metadata, the dataset enables flexible experimentation across different pipelines. --- ## Loading the Dataset ```python from datasets import load_dataset ds = load_dataset("ParthMandaliya/hotpotqa-wiki", streaming=True, split="train") ```

提供机构：

ParthMandaliya

5,000+

优质数据集

54 个

任务类型

进入经典数据集