five

Mo7art/Stack2Graph_KG

收藏
Hugging Face2026-05-19 更新2026-05-31 收录
下载链接:
https://hf-mirror.com/datasets/Mo7art/Stack2Graph_KG
下载链接
链接失效反馈
官方服务:
资源简介:
--- tags: - knowledge-graph - rdf - n-triples - sparql - semantic-web - stackoverflow - question-answering pretty_name: StackOverflow Knowledge Graph license: cc-by-sa-4.0 size_categories: - 100M<n<1B --- # Dataset Card for StackOverflow Knowledge Graph ## Summary This dataset contains the knowledge-graph component of Stack2Graph as language-specific N-Triples shards. It is the structured counterpart to the Stack2Graph vector dataset and is intended for QLever/SPARQL workflows rather than row-wise tabular use. ## Repository Layout ```text kg_rdf/ schema.nt python/ chunk0.nt chunk1.nt java/ chunk0.nt ``` The repository stores N-Triples RDF files under one folder per programming language. Upload tooling can publish one dataset per language or all configured languages. Generated source RDF artifacts are partitioned by programming language before upload: ```text kg_rdf/ schema.nt python/ part0.nt part1.nt java/ part0.nt ``` ## Content The graph is generated from Stack Overflow data after SQL ingestion and keeps structural relations between questions, answers, comments, tags, vote aggregates, and question-to-question links. The root `schema.nt` contains schema triples loaded into the default graph. Language instance triples are mapped to named graphs keyed by the original supported programming-language tag during QLever indexing, for example `http://stackoverflow.com/python`. QLever named graphs are the canonical language rooms. The default graph is a distinct-SPO convenience view over named graph data plus `schema.nt`, so total language KG size should be measured with `GRAPH ?g { ?s ?p ?o }`; use `GRAPH <http://stackoverflow.com/{language}> { ... }` for language-scoped queries. Questions are only retained when they match the supported language-tag set used by the project. ## Intended Use This dataset is meant to be downloaded, extracted, and imported into an RDF-capable graph store for retrieval and analysis workflows. It is primarily intended for system reconstruction and retrieval-based experiments. ## Source The dataset is built from the Stack Overflow dump through the Stack2Graph pipeline, including SQL import, graph construction, N-Triples serialization, and optional archive packaging. ## Limitations - The graph only covers questions matching the supported programming-language tags. - A question may appear in more than one named graph when it has multiple language tags. Repeated SPO triples across different language graphs are intentional language-membership duplication; only exact duplicate lines inside a single language file are safe to remove. - Vote information is stored as aggregates rather than individual vote events. - The dataset inherits the licensing constraints, biases, and temporal drift of Stack Overflow content. ## License This dataset is distributed under `CC-BY-SA-4.0`. ## Citation If you use this dataset, please cite the Stack2Graph paper: - Stack2Graph: A Structured Knowledge Representation of Stack Overflow Data for Retrieval-based Question Answering
提供机构:
Mo7art
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作