Wollaston/gelato
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Wollaston/gelato
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: level1
features:
- name: id
dtype: int64
- name: tokens
list: string
- name: labels
list: string
splits:
- name: train
num_bytes: 1129704
num_examples: 80
- name: dev
num_bytes: 345440
num_examples: 21
- name: test
num_bytes: 460779
num_examples: 30
download_size: 1941799
dataset_size: 1935923
- config_name: level2
features:
- name: id
dtype: int64
- name: tokens
list: string
- name: labels
list: string
splits:
- name: train
num_bytes: 1256647
num_examples: 80
- name: dev
num_bytes: 381692
num_examples: 21
- name: test
num_bytes: 521090
num_examples: 30
download_size: 2165351
dataset_size: 2159429
configs:
- config_name: level1
data_files:
- split: train
path: level1/train-*
- split: dev
path: level1/dev-*
- split: test
path: level1/test-*
- config_name: level2
data_files:
- split: train
path: level2/train-*
- split: dev
path: level2/dev-*
- split: test
path: level2/test-*
pretty_name: The GELATO Dataset for Legislative NER
license: mit
task_categories:
- token-classification
language: en
---
# GELATO
This repo contains the data from "The Gelato Dataset for Legislative NER" (LREC2026).
### Dataset Description
GELATO (Government, Executive, Legislative, and Treaty Ontology) is a dataset of U.S. House and Senate bills
from the 118th Congress annotated using a novel two-level named entity recognition ontology designed
for U.S. legislative texts.
- **Language:** English
- **License:** MIT
### Dataset Sources
- **Repository:** [GitHub](https://github.com/Wollaston/gelato)
- **Paper:** [The GELATO Dataset for Legislative NER](https://arxiv.org/abs/2603.14130)
## Uses
This dataset contains a two-level ontology to support NLP research of U.S. legislative data.
### Dataset Structure and Ontology
1. Person
1. Individual
2. Member
3. Title
2. Organization
1. Agency
2. Association
3. Committee
4. International Institution
5. Legislative Body
6. Locality
7. Nation
8. State
3. Document
1. Bill
2. Code
3. Parenthetical
4. Reference
5. Report
6. Treaty
4. Abstraction
1. Case
2. Doctrine
3. Fund
4. Infrastructure
5. Misc
6. Program
7. Session
8. Specification
9. System
5. Act
1. Amendment
2. Public Act
6. Class
1. Non-Protected Class
2. Protected Class
### Source Data, Data Collection, and Processing
All bills in the GELATO dataset are publicly available U.S. government documents obtained via the
[congress.gov API](https://api.congress.gov/#/) and are therefore in the public
domain and not subject to copyright restrictions.
### Annotation process and annotators
The three graduate student authors annotated the data following best practices. See our paper for more details.
### Bias, Risks, and Limitations
Three graduate student annotators (the authors)
with training in linguistics and NLP collaboratively
created GELATO through a two-stage process with
full adjudication of any disagreements. This is
a descriptive annotation; for example, this ontology includes Protected Class and Non-Protected
Class subclasses that are consistent with U.S. anti-discrimination law definitions.
GELATO can support beneficial applications including legislative tracking, policy analysis, and
government transparency initiatives. However, automated entity extraction could also enable
potentially harmful uses such as targeted analysis of how
specific groups are referenced in legislation or identification of
individual legislators for inappropriate purposes.
## Citation
**BibTeX:**
```
@misc{flynn2026gelatodatasetlegislativener,
title={The GELATO Dataset for Legislative NER},
author={Matthew Flynn and Timothy Obiso and Sam Newman},
year={2026},
eprint={2603.14130},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.14130},
}
```
## Dataset Card Authors
Matthew Flynn ([Wollaston](https://huggingface.co/Wollaston))
## Dataset Card Contact
Matthew Flynn ([Wollaston](https://huggingface.co/Wollaston))
提供机构:
Wollaston



