jiayi-li23/proxann_data
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/jiayi-li23/proxann_data
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: bills_train
path: bills_train.metadata.embeddings.jsonl.all-MiniLM-L6-v2.parquet
- split: bills_test
path: bills_test.metadata.parquet
- split: wiki_train
path: wiki_train.metadata.embeddings.jsonl.all-MiniLM-L6-v2.parquet
- split: wiki_test
path: wiki_test.metadata.parquet
datasets:
- lcalvobartolome/proxann_data
language:
- en
license: mit
pretty_name: PROXANN Data
size_categories:
- 10K<n<100K
tags:
- parquet
- text
- topic-modeling
- bills
- proxann
- english
---
# PROXANN Data
**PROXANN Data** provides the corpora used for training and evaluating topic models in
**[PROXANN: Use-Oriented Evaluations of Topic Models and Document Clustering](https://aclanthology.org/2025.acl-long.772/)**
(Hoyle *et al.*, ACL 2025).
This repository contains two dataset — **Bills** and **Wiki** — each with **training** (with contextualized embeddings) and **test** (metadata-only) splits.
---
## Structure
| Split | File | Rows | Description |
| ------|------|------:|-------------|
| `bills_train` | `bills_train.metadata.embeddings.jsonl.all-MiniLM-L6-v2.parquet` | 32,661 | Congressional bills with summaries, topics, and 384-dim embeddings. |
| `bills_test` | `bills_test.metadata.parquet` | 15,242 | Bills test split without embeddings (metadata only). |
| `wiki_train` | `wiki_train.metadata.embeddings.jsonl.all-MiniLM-L6-v2.parquet` | 14,290 | Wikipedia articles with categories and 384-dim embeddings. |
| `wiki_test` | `wiki_test.metadata.parquet` | 8,024 | Wikipedia test split without embeddings (metadata only). |
---
## Columns
### Bills (`bills_train` / `bills_test`)
| Column | Type | Description |
| ------- | ---- | ----------- |
| `id` | string | Unique identifier. |
| `summary` | string | Short summary of the bill. |
| `topic` | string | Primary topic label. |
| `subtopic` | string | Secondary topic label. |
| `subjects_top_term` | string | Top subject term for the bill. |
| `date` | string | Document date (ISO-8601 format). |
| `tokenized_text` | list[string] | Preprocessed tokens from Hoyle et al. (2022), 15 k vocabulary. |
| `embeddings` | list[float] (384) | Sentence embedding (MiniLM-L6-v2). *Absent in test split.* |
### Wiki (`wiki_train` / `wiki_test`)
| Column | Type | Description |
| ------- | ---- | ----------- |
| `id` | string | Unique identifier. |
| `text` | string | Article text (raw or normalized). |
| `supercategory` | string | High-level category. |
| `category` | string | Primary category. |
| `subcategory` | string | Secondary category. |
| `page_name` | string | Wikipedia page title. |
| `tokenized_text` | list[string] | Preprocessed tokens from Hoyle et al. (2022), 15 k vocabulary. |
| `embeddings` | list[float] (384) | Sentence embedding (MiniLM-L6-v2). *Absent in test split.* |
## Vocabularies
The dataset includes the **15k-token vocabularies** used during preprocessing and model training.
Each file is a JSON mapping of **token -> integer index** (0–14,999).
| File | Description |
|------|-------------|
| `data_with_embeddings/vocabs/bills_vocab.json` | Vocabulary for the Bills corpus. Keys are tokens, values are integer indices. |
| `data_with_embeddings/vocabs/wiki_vocab.json` | Vocabulary for the Wiki corpus. Keys are tokens, values are integer indices. |
## Usage Example
The dataset contains four Parquet files:
- `bills_train`
- `bills_test`
- `wiki_train`
- `wiki_test`
Because the Bills and Wiki splits use different schemas, you should load each split
directly from its Parquet file using the generic `parquet` loader from 🤗 Datasets:
```python
from datasets import load_dataset
# ------------------------------
# Bills Dataset
# ------------------------------
bills_train = load_dataset(
"parquet",
data_files={
"train": "hf://datasets/lcalvobartolome/proxann_data@main/"
"bills_train.metadata.embeddings.jsonl.all-MiniLM-L6-v2.parquet"
},
split="train",
)
print("Bills train size:", len(bills_train)) # 32661
bills_test = load_dataset(
"parquet",
data_files={
"test": "hf://datasets/lcalvobartolome/proxann_data@main/"
"bills_test.metadata.parquet"
},
split="test",
)
print("Bills test size:", len(bills_test)) # 15242
# ------------------------------
# Wiki Dataset
# ------------------------------
wiki_train = load_dataset(
"parquet",
data_files={
"train": "hf://datasets/lcalvobartolome/proxann_data@main/"
"wiki_train.metadata.embeddings.jsonl.all-MiniLM-L6-v2.parquet"
},
split="train",
)
print("Wiki train size:", len(wiki_train)) # 14290
wiki_test = load_dataset(
"parquet",
data_files={
"test": "hf://datasets/lcalvobartolome/proxann_data@main/"
"wiki_test.metadata.parquet"
},
split="test",
)
print("Wiki test size:", len(wiki_test))
```
## Related Resources
* [PROXANN GitHub Repository](https://github.com/ahoho/proxann)
* [Are Neural Topic Models Broken? (Hoyle et al., 2022)](https://aclanthology.org/2022.findings-emnlp.390/)
* [Bills Dataset — Adler & Wilkerson (2008)](http://www.congressionalbills.org)
* [WikiText Dataset — Merity et al. (2017)](https://arxiv.org/abs/1609.07843)
---
## License & Attribution
Released under the **MIT License**.
Text content derives from **Wikipedia** (*Merity et al. (2017)*) and the **Congressional Bills Project** (*Adler & Wilkerson, 2008*).
Please provide attribution when reusing these materials.
---
## Citation
If you use this dataset, please cite:
```bibtex
@inproceedings{hoyle-etal-2025-proxann,
title = "{P}rox{A}nn: Use-Oriented Evaluations of Topic Models and Document Clustering",
author = "Hoyle, Alexander Miserlis and
Calvo-Bartolom{\'e}, Lorena and
Boyd-Graber, Jordan Lee and
Resnik, Philip",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.772/",
doi = "10.18653/v1/2025.acl-long.772",
pages = "15872--15897",
ISBN = "979-8-89176-251-0",
abstract = "Topic models and document-clustering evaluations either use automated metrics that align poorly with human preferences, or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners' real-world usage of models. Annotators{---}or an LLM-based proxy{---}review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxy is statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations."
}
```
提供机构:
jiayi-li23



