jo-s-eph/gow-qa
收藏Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/jo-s-eph/gow-qa
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-3.0
task_categories:
- question-answering
language:
- en
tags:
- graph-of-words
- qa
- benchmark
- wikipedia
- gemma
pretty_name: GoW-QA
size_categories:
- 1K<n<10K
annotations_creators:
- LLM-generated
source_datasets:
- wikipedia
---
# GoW-QA: Graph-of-Words Question Answering Benchmark
A benchmark dataset for evaluating Graph-of-Words (GoW) representations in Question Answering tasks. The dataset contains Wikipedia paragraphs converted into graph structures, with question-answer pairs generated by Gemma-4-31B for evaluating how well graph-based representations preserve textual information for QA.
## Dataset Summary
| Metric | Value |
|--------|-------|
| **Total Paragraphs** | 1,621 |
| **Total Articles** | 497 |
| **Total Questions** | 8,105 |
| **Questions per Paragraph** | 5 |
| **Question Types** | FACTUAL, RELATIONAL, SUMMARIZATION |
| **Graph Configuration** | naive_w5 (window=5, all tokens) |
## Dataset Structure
Each record in the dataset contains:
- `id`: Unique paragraph identifier (format: `{doc_id}_p{index}`)
- `doc_id`: Wikipedia article identifier
- `title`: Title of the source Wikipedia article
- `para_index`: Paragraph position within the article (0-indexed)
- `context`: The raw paragraph text from Wikipedia
- `graph_config`: Graph construction configuration (default: `naive_w5`)
- `adjacency_matrix`: GoW adjacency matrix (N×N, where N = number of tokens)
- `node_labels`: Vocabulary/list of tokens in the graph
- `qa_pairs`: List of 5 question-answer pairs
### Question Types
| Type | Description | Example |
|------|-------------|---------|
| `FACTUAL` | Specific facts, dates, names, locations | "When was X born?" |
| `RELATIONAL` | Relationships between entities | "Who is X's brother?" |
| `SUMMARIZATION` | Main topic or overall meaning | "What is the main topic?" |
## Graph Construction (GoW)
The Graph-of-Words representation is built using:
- **Type**: `naive` — All tokens (no filtering)
- **Window Size**: 5 (sliding window, W=5)
- **Weighting**: Co-occurrence count within window
- **Directed**: Yes
- **Lemmatization**: Enabled (using spaCy `en_core_web_sm`)
Example adjacency list representation:
```
lamkhaga → pass (w=3)
lamkhaga → trek (w=2)
pass → connect (w=1)
```
## Source
The dataset is built from **Wikipedia** articles (dumped: `2022-03-01`).
> Wikipedia is a multilingual online encyclopedia. Wikipedia's content is published under the Creative Commons Attribution-Share-Alike License. Articles were sampled randomly from the full Wikipedia dump for diversity.
**Sampling:** 500 articles were randomly sampled (seed=42) from the 10,000-article subset.
## Motivation
This benchmark addresses a fundamental question in graph-based NLP:
> **Can graph structural representations preserve sufficient information for Question Answering?**
The dataset enables comparison between:
- **Raw text QA** (upper bound - what transformer models are trained on)
- **Graph-serialized QA** (probing what information is preserved in GoW)
## Use Cases
1. **Information Preservation Analysis**: Measure how much information is lost when converting text → graph
2. **Graph Representation Learning**: Train/evaluate GNN encoders on QA tasks
3. **Benchmarking**: Compare different GoW configurations (window size, node types, etc.)
4. **Future Work**: Enable research on graph-augmented LLM architectures
## Baseline Results
| Configuration | Match Rate | Notes |
|--------------|------------|-------|
| Control (raw text) | ~87% | Upper bound |
| GoW (naive_w5) | ~58% | Information preserved in graph |
| **Information Gap** | ~29% | Information lost in serialization |
> Note: These are preliminary results from Gemma-4-31B evaluation. Full benchmark evaluation pending.
## Dataset Versions
| File | Description |
|------|-------------|
| `gow_qa.parquet` | Flat table (1 row per QA pair) - for easy loading |
| `gow_qa_full.parquet` | Full data with adjacency matrices - for graph research |
## Loading the Dataset
```python
# Basic loading
from datasets import load_dataset
ds = load_dataset("your-username/gow-qa")
# Or load directly from local parquet
import pandas as pd
df = pd.read_parquet("gow_qa.parquet")
```
## Citation
If you use this dataset, please cite:
```
@article{gow-qa-2026,
title={GoW-QA: A Graph-of-Words Question Answering Benchmark},
author={},
year={2026}
}
```
## License
This dataset is based on Wikipedia content, which is licensed under the **Creative Commons Attribution-Share-Alike License 3.0**.
The dataset itself (graph structures, QA pairs, annotations) is made available under the same license.
## Limitations
- **Domain**: Primarily biographical/encyclopedic Wikipedia (limited to 497 articles)
- **Language**: English only
- **Graph Config**: Only `naive_w5` evaluated in current version
- **QA Pairs**: Generated by Gemma-4-31B (may contain minor errors)
## Future Work
- [ ] Expand to more Wikipedia domains (scientific, historical, technical)
- [ ] Evaluate additional GoW configurations (noun, nounChunks, different window sizes)
- [ ] Train GNN encoders on the dataset
- [ ] Multi-lingual extension
- [ ] Human-verified gold answers
## Contact
For questions, issues, or collaboration inquiries, please open a GitHub issue.
---
**Dataset Card created:** April 2026
提供机构:
jo-s-eph



