KRLabsOrg/acl-anthology-md

Name: KRLabsOrg/acl-anthology-md
Creator: KRLabsOrg
Published: 2026-04-16 15:24:04
License: 暂无描述

Hugging Face2026-04-16 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/KRLabsOrg/acl-anthology-md

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-retrieval - question-answering - text-generation language: - en pretty_name: ACL Anthology Markdown Corpus size_categories: - 100K<n<1M modalities: - text tags: - acl-anthology - scientific-papers - rag - retrieval configs: - config_name: metadata data_files: - split: train path: metadata/train-* - config_name: fulltext data_files: - split: train path: fulltext/train-* --- # ACL Anthology Markdown Corpus A snapshot of the [ACL Anthology](https://aclanthology.org/) consisting of bibliographic metadata for **120,034** papers and full-text **markdown conversions of 114,484 papers** (≈95% of the catalogue, the remainder are frontmatter, abstract-only entries, or papers without an available PDF). This corpus is the document collection used by [ACL-Verbatim](https://github.com/krlabsorg/acl-verbatim), a hallucination-free question-answering system for NLP research papers built on top of [VerbatimRAG](https://github.com/KRLabsOrg/verbatim-rag). ## Configurations The dataset ships as **two configs** that share the join key `anthology_id` (e.g. `2023.acl-long.42`). Use only what you need — the metadata config is small, the fulltext config is large. ```python from datasets import load_dataset meta = load_dataset("KRLabsOrg/acl-anthology-md", "metadata", split="train") full = load_dataset("KRLabsOrg/acl-anthology-md", "fulltext", split="train") # Join on anthology_id when you want both: by_id = {row["anthology_id"]: row for row in full} for paper in meta.filter(lambda r: r["has_markdown"]): md = by_id[paper["anthology_id"]]["markdown"] ``` ### `metadata` (1 row per paper) All 120,034 papers from the ACL Anthology, including frontmatter and abstract-only entries. | field | type | notes | |-------|------|-------| | `anthology_id` | string | e.g. `2023.acl-long.42`. Join key. | | `paper_id` | string | Internal Anthology numeric id. | | `bibkey`, `bibtype`, `bibtex` | string | BibTeX key, entry type, and full record. | | `title`, `title_html`, `title_raw` | string | Cleaned, HTML, and raw forms of the title. | | `author` | list<struct> | `{id, first, last, full}` per author. | | `author_string`, `editor` | string / list | Author display string; editors for proceedings. | | `url`, `pdf`, `thumbnail`, `doi` | string | Anthology page, PDF, thumbnail, DOI. | | `citation`, `citation_acl` | string | Markdown and ACL-style citations. | | `booktitle`, `parent_volume_id`, `year`, `venue` | string / list | Venue metadata. `venue` is a list of slugs (e.g. `["acl"]`). | | `pages`, `page_first`, `page_last` | string | Pagination, when reported. | | `abstract_html`, `abstract_raw` | string | Available for ~72k papers. | | `language`, `attachment` | string / list | Non-English language flag; supplementary attachments. | | `ingest_date` | string | ISO date the record entered the Anthology. | | `has_markdown` | bool | Convenience flag — true iff `fulltext` contains this `anthology_id`. | ### `fulltext` (1 row per converted paper) | field | type | notes | |-------|------|-------| | `anthology_id` | string | Join key against `metadata`. | | `markdown` | string | Full paper body in markdown, produced by [docling](https://github.com/DS4SD/docling). | ## Corpus statistics | | | |---|---| | Papers in metadata | 120,034 | | Papers with full-text markdown | 114,484 (95.4%) | | Year range | 1952 – 2026 | | Distinct venues | 500 | | Papers with abstract | 71,902 | | Mean authors per paper | 3.7 | | Total markdown size | 5.10 GB raw | | Markdown per paper (median / p90 / p99) | 37 KB / 74 KB / 162 KB | **Top venues by paper count:** `acl` (13,664), `emnlp` (11,525), `ws` (10,714), `findings` (10,519), `lrec` (9,105), `coling` (8,701), `naacl` (5,458), `ijcnlp` (3,871), `semeval` (3,330), `jeptalnrecital` (2,766). **Recent years:** 2025 (14,577), 2024 (12,098), 2023 (9,032), 2022 (8,649), 2021 (7,148). ## How the corpus was built 1. **Metadata extraction.** Run against a local checkout of [`acl-org/acl-anthology`](https://github.com/acl-org/acl-anthology) using its `Anthology` Python API plus the repository's own `create_hugo_data.paper_to_dict` to flatten each paper to a JSON record. One JSONL row per paper. See `scripts/get_anthology_metadata.py` in the [acl-verbatim](https://github.com/krlabsorg/acl-verbatim) repository. 2. **PDF download.** PDFs are obtained via the `acl-anthology` repository's standard download tooling. Per the Anthology's request, we did not redistribute PDFs — only the docling-converted markdown is included here. 3. **PDF → markdown conversion.** Each PDF is converted with [docling](https://github.com/DS4SD/docling)'s `DocumentConverter` in batched mode (`doc_batch_size=512`, `page_batch_size=1024`), exporting via `document.export_to_markdown()`. Conversion was run on a single A100 GPU. A small allow-list of papers is skipped because they segfault docling or hang during conversion — these papers are present in `metadata` with `has_markdown=False`. See `scripts/preprocess_acl.py`. 4. **Dataset assembly.** `scripts/build_corpus_dataset.py` walks the markdown directory, joins each file to its metadata record by `anthology_id` (derived from the paper URL), normalizes the metadata schema, and writes both configs. The conversion is automated and not manually validated — markdown quality varies with the underlying PDF (older scanned papers, complex tables, math-heavy papers may convert imperfectly). Treat the output as a strong starting point for chunking and retrieval, not as a verbatim transcription. ## Intended uses - Retrieval-augmented generation over NLP research literature. - Training and evaluating extractive QA / citation-grounding systems on scientific text. - Bibliometric and meta-research studies of the NLP community. ## Citation A paper describing ACL-Verbatim is in preparation; citation details will be added here once available. ## Acknowledgements This work would not have been possible without the [ACL Anthology](https://aclanthology.org/) and the maintainers of the [acl-anthology](https://github.com/acl-org/acl-anthology) repository, whose tooling and permissive policies enabled both the metadata extraction and the PDF-to-markdown conversion at scale.

提供机构：

KRLabsOrg

5,000+

优质数据集

54 个

任务类型

进入经典数据集