Mo7art/Stack2Graph_KG

Name: Mo7art/Stack2Graph_KG
Creator: Mo7art
Published: 2026-05-19 19:55:51
License: 暂无描述

Hugging Face2026-05-19 更新2026-05-31 收录

下载链接：

https://hf-mirror.com/datasets/Mo7art/Stack2Graph_KG

下载链接

链接失效反馈

官方服务：

资源简介：

--- tags: - knowledge-graph - rdf - n-triples - sparql - semantic-web - stackoverflow - question-answering pretty_name: StackOverflow Knowledge Graph license: cc-by-sa-4.0 size_categories: - 100M<n<1B --- # Dataset Card for StackOverflow Knowledge Graph ## Summary This dataset contains the knowledge-graph component of Stack2Graph as language-specific N-Triples shards. It is the structured counterpart to the Stack2Graph vector dataset and is intended for QLever/SPARQL workflows rather than row-wise tabular use. ## Repository Layout ```text kg_rdf/ schema.nt python/ chunk0.nt chunk1.nt java/ chunk0.nt ``` The repository stores N-Triples RDF files under one folder per programming language. Upload tooling can publish one dataset per language or all configured languages. Generated source RDF artifacts are partitioned by programming language before upload: ```text kg_rdf/ schema.nt python/ part0.nt part1.nt java/ part0.nt ``` ## Content The graph is generated from Stack Overflow data after SQL ingestion and keeps structural relations between questions, answers, comments, tags, vote aggregates, and question-to-question links. The root `schema.nt` contains schema triples loaded into the default graph. Language instance triples are mapped to named graphs keyed by the original supported programming-language tag during QLever indexing, for example `http://stackoverflow.com/python`. QLever named graphs are the canonical language rooms. The default graph is a distinct-SPO convenience view over named graph data plus `schema.nt`, so total language KG size should be measured with `GRAPH ?g { ?s ?p ?o }`; use `GRAPH <http://stackoverflow.com/{language}> { ... }` for language-scoped queries. Questions are only retained when they match the supported language-tag set used by the project. ## Intended Use This dataset is meant to be downloaded, extracted, and imported into an RDF-capable graph store for retrieval and analysis workflows. It is primarily intended for system reconstruction and retrieval-based experiments. ## Source The dataset is built from the Stack Overflow dump through the Stack2Graph pipeline, including SQL import, graph construction, N-Triples serialization, and optional archive packaging. ## Limitations - The graph only covers questions matching the supported programming-language tags. - A question may appear in more than one named graph when it has multiple language tags. Repeated SPO triples across different language graphs are intentional language-membership duplication; only exact duplicate lines inside a single language file are safe to remove. - Vote information is stored as aggregates rather than individual vote events. - The dataset inherits the licensing constraints, biases, and temporal drift of Stack Overflow content. ## License This dataset is distributed under `CC-BY-SA-4.0`. ## Citation If you use this dataset, please cite the Stack2Graph paper: - Stack2Graph: A Structured Knowledge Representation of Stack Overflow Data for Retrieval-based Question Answering

提供机构：

Mo7art

5,000+

优质数据集

54 个

任务类型

进入经典数据集