five

Wollaston/gelato

收藏
Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Wollaston/gelato
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: level1 features: - name: id dtype: int64 - name: tokens list: string - name: labels list: string splits: - name: train num_bytes: 1129704 num_examples: 80 - name: dev num_bytes: 345440 num_examples: 21 - name: test num_bytes: 460779 num_examples: 30 download_size: 1941799 dataset_size: 1935923 - config_name: level2 features: - name: id dtype: int64 - name: tokens list: string - name: labels list: string splits: - name: train num_bytes: 1256647 num_examples: 80 - name: dev num_bytes: 381692 num_examples: 21 - name: test num_bytes: 521090 num_examples: 30 download_size: 2165351 dataset_size: 2159429 configs: - config_name: level1 data_files: - split: train path: level1/train-* - split: dev path: level1/dev-* - split: test path: level1/test-* - config_name: level2 data_files: - split: train path: level2/train-* - split: dev path: level2/dev-* - split: test path: level2/test-* pretty_name: The GELATO Dataset for Legislative NER license: mit task_categories: - token-classification language: en --- # GELATO This repo contains the data from "The Gelato Dataset for Legislative NER" (LREC2026). ### Dataset Description GELATO (Government, Executive, Legislative, and Treaty Ontology) is a dataset of U.S. House and Senate bills from the 118th Congress annotated using a novel two-level named entity recognition ontology designed for U.S. legislative texts. - **Language:** English - **License:** MIT ### Dataset Sources - **Repository:** [GitHub](https://github.com/Wollaston/gelato) - **Paper:** [The GELATO Dataset for Legislative NER](https://arxiv.org/abs/2603.14130) ## Uses This dataset contains a two-level ontology to support NLP research of U.S. legislative data. ### Dataset Structure and Ontology 1. Person 1. Individual 2. Member 3. Title 2. Organization 1. Agency 2. Association 3. Committee 4. International Institution 5. Legislative Body 6. Locality 7. Nation 8. State 3. Document 1. Bill 2. Code 3. Parenthetical 4. Reference 5. Report 6. Treaty 4. Abstraction 1. Case 2. Doctrine 3. Fund 4. Infrastructure 5. Misc 6. Program 7. Session 8. Specification 9. System 5. Act 1. Amendment 2. Public Act 6. Class 1. Non-Protected Class 2. Protected Class ### Source Data, Data Collection, and Processing All bills in the GELATO dataset are publicly available U.S. government documents obtained via the [congress.gov API](https://api.congress.gov/#/) and are therefore in the public domain and not subject to copyright restrictions. ### Annotation process and annotators The three graduate student authors annotated the data following best practices. See our paper for more details. ### Bias, Risks, and Limitations Three graduate student annotators (the authors) with training in linguistics and NLP collaboratively created GELATO through a two-stage process with full adjudication of any disagreements. This is a descriptive annotation; for example, this ontology includes Protected Class and Non-Protected Class subclasses that are consistent with U.S. anti-discrimination law definitions. GELATO can support beneficial applications including legislative tracking, policy analysis, and government transparency initiatives. However, automated entity extraction could also enable potentially harmful uses such as targeted analysis of how specific groups are referenced in legislation or identification of individual legislators for inappropriate purposes. ## Citation **BibTeX:** ``` @misc{flynn2026gelatodatasetlegislativener, title={The GELATO Dataset for Legislative NER}, author={Matthew Flynn and Timothy Obiso and Sam Newman}, year={2026}, eprint={2603.14130}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.14130}, } ``` ## Dataset Card Authors Matthew Flynn ([Wollaston](https://huggingface.co/Wollaston)) ## Dataset Card Contact Matthew Flynn ([Wollaston](https://huggingface.co/Wollaston))
提供机构:
Wollaston
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作