Mo7art/Stack2Graph_KG
收藏Hugging Face2026-05-19 更新2026-05-31 收录
下载链接:
https://hf-mirror.com/datasets/Mo7art/Stack2Graph_KG
下载链接
链接失效反馈官方服务:
资源简介:
---
tags:
- knowledge-graph
- rdf
- n-triples
- sparql
- semantic-web
- stackoverflow
- question-answering
pretty_name: StackOverflow Knowledge Graph
license: cc-by-sa-4.0
size_categories:
- 100M<n<1B
---
# Dataset Card for StackOverflow Knowledge Graph
## Summary
This dataset contains the knowledge-graph component of Stack2Graph as language-specific N-Triples shards.
It is the structured counterpart to the Stack2Graph vector dataset and is intended for QLever/SPARQL workflows rather than row-wise tabular use.
## Repository Layout
```text
kg_rdf/
schema.nt
python/
chunk0.nt
chunk1.nt
java/
chunk0.nt
```
The repository stores N-Triples RDF files under one folder per programming language. Upload tooling can publish one dataset per language or all configured languages.
Generated source RDF artifacts are partitioned by programming language before upload:
```text
kg_rdf/
schema.nt
python/
part0.nt
part1.nt
java/
part0.nt
```
## Content
The graph is generated from Stack Overflow data after SQL ingestion and keeps structural relations between questions, answers, comments, tags, vote aggregates, and question-to-question links.
The root `schema.nt` contains schema triples loaded into the default graph. Language instance triples are mapped to named graphs keyed by the original supported programming-language tag during QLever indexing, for example `http://stackoverflow.com/python`. QLever named graphs are the canonical language rooms. The default graph is a distinct-SPO convenience view over named graph data plus `schema.nt`, so total language KG size should be measured with `GRAPH ?g { ?s ?p ?o }`; use `GRAPH <http://stackoverflow.com/{language}> { ... }` for language-scoped queries.
Questions are only retained when they match the supported language-tag set used by the project.
## Intended Use
This dataset is meant to be downloaded, extracted, and imported into an RDF-capable graph store for retrieval and analysis workflows.
It is primarily intended for system reconstruction and retrieval-based experiments.
## Source
The dataset is built from the Stack Overflow dump through the Stack2Graph pipeline, including SQL import, graph construction, N-Triples serialization, and optional archive packaging.
## Limitations
- The graph only covers questions matching the supported programming-language tags.
- A question may appear in more than one named graph when it has multiple language tags. Repeated SPO triples across different language graphs are intentional language-membership duplication; only exact duplicate lines inside a single language file are safe to remove.
- Vote information is stored as aggregates rather than individual vote events.
- The dataset inherits the licensing constraints, biases, and temporal drift of Stack Overflow content.
## License
This dataset is distributed under `CC-BY-SA-4.0`.
## Citation
If you use this dataset, please cite the Stack2Graph paper:
- Stack2Graph: A Structured Knowledge Representation of Stack Overflow Data for Retrieval-based Question Answering
提供机构:
Mo7art



