five

ParthMandaliya/hotpotqa-wiki

收藏
Hugging Face2026-01-29 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ParthMandaliya/hotpotqa-wiki
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 language: - en tags: - wikipedia - hotpotqa - multi-hop - human-readable --- # HotpotQA Wikipedia Corpus This dataset is a processed Wikipedia corpus derived from the **HotpotQA Wikipedia dump**. The original dump was downloaded from: - https://nlp.stanford.edu/projects/hotpotqa/enwiki-20171001-pages-meta-current-withlinks-processed.tar.bz2 Additional details about the structure and preprocessing of the original HotpotQA Wikipedia dump are available here: - https://hotpotqa.github.io/wiki-readme.html --- ## Dataset Description This dataset is built on top of the original HotpotQA Wikipedia dump without modifying the raw source content. The original Wikipedia text is preserved as-is, and additional features are derived from it to make the dataset more convenient for downstream NLP and knowledge graph tasks. In particular, two new features are created **by processing the original `text` field**, while keeping the original field intact. --- ## Features Each row corresponds to a single Wikipedia article and contains the following fields: - **`id`** A unique identifier for the article. - **`url`** The original Wikipedia URL of the article. - **`title`** The article title. - **`charoffset`** Character-level offset information associated with the original `text` field. - **`text`** The original article content from the HotpotQA Wikipedia dump. - **NOTE: The features `id`, `url`, `title`, `text`, and `charoffset` are unmodified.** - **`article`** A cleaned, plain-text version of the article derived from the `text` field. HTML markup has been removed, and anchor tags have been replaced by their visible text to preserve semantic content and readability. - **`links`** A structured list of hyperlinks extracted from the original `text` field. Each entry contains: - the visible anchor text - the corresponding Wikipedia link target (`href`) - This feature provides human-curated entity references that are useful for entity linking and knowledge graph construction. --- ## Motivation and Use Cases This dataset is designed to support a wide range of NLP and graph-based applications, including: - Multi-hop question answering - Retrieval-augmented generation (RAG) - Knowledge graph construction - Entity linking - Relation extraction - Large-scale analysis of Wikipedia content By preserving the original raw text while also providing cleaned text and structured link metadata, the dataset enables flexible experimentation across different pipelines. --- ## Loading the Dataset ```python from datasets import load_dataset ds = load_dataset("ParthMandaliya/hotpotqa-wiki", streaming=True, split="train") ```
提供机构:
ParthMandaliya
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作