five

jo-s-eph/gow-qa

收藏
Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/jo-s-eph/gow-qa
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-3.0 task_categories: - question-answering language: - en tags: - graph-of-words - qa - benchmark - wikipedia - gemma pretty_name: GoW-QA size_categories: - 1K<n<10K annotations_creators: - LLM-generated source_datasets: - wikipedia --- # GoW-QA: Graph-of-Words Question Answering Benchmark A benchmark dataset for evaluating Graph-of-Words (GoW) representations in Question Answering tasks. The dataset contains Wikipedia paragraphs converted into graph structures, with question-answer pairs generated by Gemma-4-31B for evaluating how well graph-based representations preserve textual information for QA. ## Dataset Summary | Metric | Value | |--------|-------| | **Total Paragraphs** | 1,621 | | **Total Articles** | 497 | | **Total Questions** | 8,105 | | **Questions per Paragraph** | 5 | | **Question Types** | FACTUAL, RELATIONAL, SUMMARIZATION | | **Graph Configuration** | naive_w5 (window=5, all tokens) | ## Dataset Structure Each record in the dataset contains: - `id`: Unique paragraph identifier (format: `{doc_id}_p{index}`) - `doc_id`: Wikipedia article identifier - `title`: Title of the source Wikipedia article - `para_index`: Paragraph position within the article (0-indexed) - `context`: The raw paragraph text from Wikipedia - `graph_config`: Graph construction configuration (default: `naive_w5`) - `adjacency_matrix`: GoW adjacency matrix (N×N, where N = number of tokens) - `node_labels`: Vocabulary/list of tokens in the graph - `qa_pairs`: List of 5 question-answer pairs ### Question Types | Type | Description | Example | |------|-------------|---------| | `FACTUAL` | Specific facts, dates, names, locations | "When was X born?" | | `RELATIONAL` | Relationships between entities | "Who is X's brother?" | | `SUMMARIZATION` | Main topic or overall meaning | "What is the main topic?" | ## Graph Construction (GoW) The Graph-of-Words representation is built using: - **Type**: `naive` — All tokens (no filtering) - **Window Size**: 5 (sliding window, W=5) - **Weighting**: Co-occurrence count within window - **Directed**: Yes - **Lemmatization**: Enabled (using spaCy `en_core_web_sm`) Example adjacency list representation: ``` lamkhaga → pass (w=3) lamkhaga → trek (w=2) pass → connect (w=1) ``` ## Source The dataset is built from **Wikipedia** articles (dumped: `2022-03-01`). > Wikipedia is a multilingual online encyclopedia. Wikipedia's content is published under the Creative Commons Attribution-Share-Alike License. Articles were sampled randomly from the full Wikipedia dump for diversity. **Sampling:** 500 articles were randomly sampled (seed=42) from the 10,000-article subset. ## Motivation This benchmark addresses a fundamental question in graph-based NLP: > **Can graph structural representations preserve sufficient information for Question Answering?** The dataset enables comparison between: - **Raw text QA** (upper bound - what transformer models are trained on) - **Graph-serialized QA** (probing what information is preserved in GoW) ## Use Cases 1. **Information Preservation Analysis**: Measure how much information is lost when converting text → graph 2. **Graph Representation Learning**: Train/evaluate GNN encoders on QA tasks 3. **Benchmarking**: Compare different GoW configurations (window size, node types, etc.) 4. **Future Work**: Enable research on graph-augmented LLM architectures ## Baseline Results | Configuration | Match Rate | Notes | |--------------|------------|-------| | Control (raw text) | ~87% | Upper bound | | GoW (naive_w5) | ~58% | Information preserved in graph | | **Information Gap** | ~29% | Information lost in serialization | > Note: These are preliminary results from Gemma-4-31B evaluation. Full benchmark evaluation pending. ## Dataset Versions | File | Description | |------|-------------| | `gow_qa.parquet` | Flat table (1 row per QA pair) - for easy loading | | `gow_qa_full.parquet` | Full data with adjacency matrices - for graph research | ## Loading the Dataset ```python # Basic loading from datasets import load_dataset ds = load_dataset("your-username/gow-qa") # Or load directly from local parquet import pandas as pd df = pd.read_parquet("gow_qa.parquet") ``` ## Citation If you use this dataset, please cite: ``` @article{gow-qa-2026, title={GoW-QA: A Graph-of-Words Question Answering Benchmark}, author={}, year={2026} } ``` ## License This dataset is based on Wikipedia content, which is licensed under the **Creative Commons Attribution-Share-Alike License 3.0**. The dataset itself (graph structures, QA pairs, annotations) is made available under the same license. ## Limitations - **Domain**: Primarily biographical/encyclopedic Wikipedia (limited to 497 articles) - **Language**: English only - **Graph Config**: Only `naive_w5` evaluated in current version - **QA Pairs**: Generated by Gemma-4-31B (may contain minor errors) ## Future Work - [ ] Expand to more Wikipedia domains (scientific, historical, technical) - [ ] Evaluate additional GoW configurations (noun, nounChunks, different window sizes) - [ ] Train GNN encoders on the dataset - [ ] Multi-lingual extension - [ ] Human-verified gold answers ## Contact For questions, issues, or collaboration inquiries, please open a GitHub issue. --- **Dataset Card created:** April 2026
提供机构:
jo-s-eph
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作