five

jiayi-li23/proxann_data

收藏
Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/jiayi-li23/proxann_data
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: bills_train path: bills_train.metadata.embeddings.jsonl.all-MiniLM-L6-v2.parquet - split: bills_test path: bills_test.metadata.parquet - split: wiki_train path: wiki_train.metadata.embeddings.jsonl.all-MiniLM-L6-v2.parquet - split: wiki_test path: wiki_test.metadata.parquet datasets: - lcalvobartolome/proxann_data language: - en license: mit pretty_name: PROXANN Data size_categories: - 10K<n<100K tags: - parquet - text - topic-modeling - bills - proxann - english --- # PROXANN Data **PROXANN Data** provides the corpora used for training and evaluating topic models in **[PROXANN: Use-Oriented Evaluations of Topic Models and Document Clustering](https://aclanthology.org/2025.acl-long.772/)** (Hoyle *et al.*, ACL 2025). This repository contains two dataset — **Bills** and **Wiki** — each with **training** (with contextualized embeddings) and **test** (metadata-only) splits. --- ## Structure | Split | File | Rows | Description | | ------|------|------:|-------------| | `bills_train` | `bills_train.metadata.embeddings.jsonl.all-MiniLM-L6-v2.parquet` | 32,661 | Congressional bills with summaries, topics, and 384-dim embeddings. | | `bills_test` | `bills_test.metadata.parquet` | 15,242 | Bills test split without embeddings (metadata only). | | `wiki_train` | `wiki_train.metadata.embeddings.jsonl.all-MiniLM-L6-v2.parquet` | 14,290 | Wikipedia articles with categories and 384-dim embeddings. | | `wiki_test` | `wiki_test.metadata.parquet` | 8,024 | Wikipedia test split without embeddings (metadata only). | --- ## Columns ### Bills (`bills_train` / `bills_test`) | Column | Type | Description | | ------- | ---- | ----------- | | `id` | string | Unique identifier. | | `summary` | string | Short summary of the bill. | | `topic` | string | Primary topic label. | | `subtopic` | string | Secondary topic label. | | `subjects_top_term` | string | Top subject term for the bill. | | `date` | string | Document date (ISO-8601 format). | | `tokenized_text` | list[string] | Preprocessed tokens from Hoyle et al. (2022), 15 k vocabulary. | | `embeddings` | list[float] (384) | Sentence embedding (MiniLM-L6-v2). *Absent in test split.* | ### Wiki (`wiki_train` / `wiki_test`) | Column | Type | Description | | ------- | ---- | ----------- | | `id` | string | Unique identifier. | | `text` | string | Article text (raw or normalized). | | `supercategory` | string | High-level category. | | `category` | string | Primary category. | | `subcategory` | string | Secondary category. | | `page_name` | string | Wikipedia page title. | | `tokenized_text` | list[string] | Preprocessed tokens from Hoyle et al. (2022), 15 k vocabulary. | | `embeddings` | list[float] (384) | Sentence embedding (MiniLM-L6-v2). *Absent in test split.* | ## Vocabularies The dataset includes the **15k-token vocabularies** used during preprocessing and model training. Each file is a JSON mapping of **token -> integer index** (0–14,999). | File | Description | |------|-------------| | `data_with_embeddings/vocabs/bills_vocab.json` | Vocabulary for the Bills corpus. Keys are tokens, values are integer indices. | | `data_with_embeddings/vocabs/wiki_vocab.json` | Vocabulary for the Wiki corpus. Keys are tokens, values are integer indices. | ## Usage Example The dataset contains four Parquet files: - `bills_train` - `bills_test` - `wiki_train` - `wiki_test` Because the Bills and Wiki splits use different schemas, you should load each split directly from its Parquet file using the generic `parquet` loader from 🤗 Datasets: ```python from datasets import load_dataset # ------------------------------ # Bills Dataset # ------------------------------ bills_train = load_dataset( "parquet", data_files={ "train": "hf://datasets/lcalvobartolome/proxann_data@main/" "bills_train.metadata.embeddings.jsonl.all-MiniLM-L6-v2.parquet" }, split="train", ) print("Bills train size:", len(bills_train)) # 32661 bills_test = load_dataset( "parquet", data_files={ "test": "hf://datasets/lcalvobartolome/proxann_data@main/" "bills_test.metadata.parquet" }, split="test", ) print("Bills test size:", len(bills_test)) # 15242 # ------------------------------ # Wiki Dataset # ------------------------------ wiki_train = load_dataset( "parquet", data_files={ "train": "hf://datasets/lcalvobartolome/proxann_data@main/" "wiki_train.metadata.embeddings.jsonl.all-MiniLM-L6-v2.parquet" }, split="train", ) print("Wiki train size:", len(wiki_train)) # 14290 wiki_test = load_dataset( "parquet", data_files={ "test": "hf://datasets/lcalvobartolome/proxann_data@main/" "wiki_test.metadata.parquet" }, split="test", ) print("Wiki test size:", len(wiki_test)) ``` ## Related Resources * [PROXANN GitHub Repository](https://github.com/ahoho/proxann) * [Are Neural Topic Models Broken? (Hoyle et al., 2022)](https://aclanthology.org/2022.findings-emnlp.390/) * [Bills Dataset — Adler & Wilkerson (2008)](http://www.congressionalbills.org) * [WikiText Dataset — Merity et al. (2017)](https://arxiv.org/abs/1609.07843) --- ## License & Attribution Released under the **MIT License**. Text content derives from **Wikipedia** (*Merity et al. (2017)*) and the **Congressional Bills Project** (*Adler & Wilkerson, 2008*). Please provide attribution when reusing these materials. --- ## Citation If you use this dataset, please cite: ```bibtex @inproceedings{hoyle-etal-2025-proxann, title = "{P}rox{A}nn: Use-Oriented Evaluations of Topic Models and Document Clustering", author = "Hoyle, Alexander Miserlis and Calvo-Bartolom{\'e}, Lorena and Boyd-Graber, Jordan Lee and Resnik, Philip", editor = "Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher", booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.acl-long.772/", doi = "10.18653/v1/2025.acl-long.772", pages = "15872--15897", ISBN = "979-8-89176-251-0", abstract = "Topic models and document-clustering evaluations either use automated metrics that align poorly with human preferences, or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners' real-world usage of models. Annotators{---}or an LLM-based proxy{---}review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxy is statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations." } ```
提供机构:
jiayi-li23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作