WINGNUS/ACL-OCL
收藏Hugging Face2023-09-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/WINGNUS/ACL-OCL
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators: []
language:
- en
language_creators:
- found
license:
- mit
multilinguality:
- monolingual
paperswithcode_id: acronym-identification
pretty_name: acl-ocl-corpus
size_categories:
- 10K<n<100K
source_datasets:
- original
tags:
- research papers
- acl
task_categories:
- token-classification
task_ids: []
train-eval-index:
- col_mapping:
labels: tags
tokens: tokens
config: default
splits:
eval_split: test
task: token-classification
task_id: entity_extraction
---
# Dataset Card for ACL Anthology Corpus
[](https://creativecommons.org/licenses/by-nc-sa/4.0/)
This repository provides full-text and metadata to the ACL anthology collection (80k articles/posters as of September 2022) also including .pdf files and grobid extractions of the pdfs.
## How is this different from what ACL anthology provides and what already exists?
- We provide pdfs, full-text, references and other details extracted by grobid from the PDFs while [ACL Anthology](https://aclanthology.org/anthology+abstracts.bib.gz) only provides abstracts.
- There exists a similar corpus call [ACL Anthology Network](https://clair.eecs.umich.edu/aan/about.php) but is now showing its age with just 23k papers from Dec 2016.
```python
>>> import pandas as pd
>>> df = pd.read_parquet('acl-publication-info.74k.parquet')
>>> df
acl_id abstract full_text corpus_paper_id pdf_hash ... number volume journal editor isbn
0 O02-2002 There is a need to measure word similarity whe... There is a need to measure word similarity whe... 18022704 0b09178ac8d17a92f16140365363d8df88c757d0 ... None None None None None
1 L02-1310 8220988 8d5e31610bc82c2abc86bc20ceba684c97e66024 ... None None None None None
2 R13-1042 Thread disentanglement is the task of separati... Thread disentanglement is the task of separati... 16703040 3eb736b17a5acb583b9a9bd99837427753632cdb ... None None None None None
3 W05-0819 In this paper, we describe a word alignment al... In this paper, we describe a word alignment al... 1215281 b20450f67116e59d1348fc472cfc09f96e348f55 ... None None None None None
4 L02-1309 18078432 011e943b64a78dadc3440674419821ee080f0de3 ... None None None None None
... ... ... ... ... ... ... ... ... ... ... ...
73280 P99-1002 This paper describes recent progress and the a... This paper describes recent progress and the a... 715160 ab17a01f142124744c6ae425f8a23011366ec3ee ... None None None None None
73281 P00-1009 We present an LFG-DOP parser which uses fragme... We present an LFG-DOP parser which uses fragme... 1356246 ad005b3fd0c867667118482227e31d9378229751 ... None None None None None
73282 P99-1056 The processes through which readers evoke ment... The processes through which readers evoke ment... 7277828 924cf7a4836ebfc20ee094c30e61b949be049fb6 ... None None None None None
73283 P99-1051 This paper examines the extent to which verb d... This paper examines the extent to which verb d... 1829043 6b1f6f28ee36de69e8afac39461ee1158cd4d49a ... None None None None None
73284 P00-1013 Spoken dialogue managers have benefited from u... Spoken dialogue managers have benefited from u... 10903652 483c818c09e39d9da47103fbf2da8aaa7acacf01 ... None None None None None
[73285 rows x 21 columns]
```
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Dataset Creation](#dataset-creation)
- [Source Data](#source-data)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Repository:** https://github.com/shauryr/ACL-anthology-corpus
- **Point of Contact:** shauryr@gmail.com
### Dataset Summary
Dataframe with extracted metadata (table below with details) and full text of the collection for analysis : **size 489M**
### Languages
en, zh and others
## Dataset Structure
Dataframe
### Data Instances
Each row is a paper from ACL anthology
### Data Fields
| **Column name** | **Description** |
| :---------------: | :---------------------------: |
| `acl_id` | unique ACL id |
| `abstract` | abstract extracted by GROBID |
| `full_text` | full text extracted by GROBID |
| `corpus_paper_id` | Semantic Scholar ID |
| `pdf_hash` | sha1 hash of the pdf |
| `numcitedby` | number of citations from S2 |
| `url` | link of publication |
| `publisher` | - |
| `address` | Address of conference |
| `year` | - |
| `month` | - |
| `booktitle` | - |
| `author` | list of authors |
| `title` | title of paper |
| `pages` | - |
| `doi` | - |
| `number` | - |
| `volume` | - |
| `journal` | - |
| `editor` | - |
| `isbn` | - |
## Dataset Creation
The corpus has all the papers in ACL anthology - as of September'22
### Source Data
- [ACL Anthology](aclanthology.org)
- [Semantic Scholar](semanticscholar.org)
# Additional Information
### Licensing Information
The ACL OCL corpus is released under the [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). By using this corpus, you are agreeing to its usage terms.
### Citation Information
If you use this corpus in your research please use the following BibTeX entry:
@Misc{acl-ocl,
author = {Shaurya Rohatgi, Yanxia Qin, Benjamin Aw, Niranjana Unnithan, Min-Yen Kan},
title = {The ACL OCL Corpus: advancing Open science in Computational Linguistics},
howpublished = {arXiv},
year = {2022},
url = {https://huggingface.co/datasets/ACL-OCL/ACL-OCL-Corpus}
}
### Acknowledgements
We thank Semantic Scholar for providing access to the citation-related data in this corpus.
### Contributions
Thanks to [@shauryr](https://github.com/shauryr), [Yanxia Qin](https://github.com/qolina) and [Benjamin Aw](https://github.com/Benjamin-Aw-93) for adding this dataset.
提供机构:
WINGNUS
原始信息汇总
数据集概述
数据集名称
- 名称: ACL Anthology Corpus
- 别名: acl-ocl-corpus
数据集描述
- 摘要: 该数据集包含ACL文集的元数据和全文,用于分析。数据集大小为489M。
- 语言: 主要语言为英语(en),以及其他语言。
数据集结构
- 数据实例: 每行代表ACL文集的一篇论文。
- 数据字段:
acl_id: 唯一ACL标识符abstract: 通过GROBID提取的摘要full_text: 通过GROBID提取的全文corpus_paper_id: Semantic Scholar IDpdf_hash: PDF文件的SHA1哈希值numcitedby: 从S2获得引用次数url: 出版物链接publisher: 出版商address: 会议地址year: 年份month: 月份booktitle: 书名author: 作者列表title: 论文标题pages: 页码doi: DOInumber: 编号volume: 卷号journal: 期刊名称editor: 编辑isbn: ISBN
数据集创建
- 源数据: 数据来源于ACL Anthology和Semantic Scholar。
许可证信息
- 许可证: 该数据集根据CC BY-NC 4.0许可证发布。
引用信息
-
引用格式:
@Misc{acl-ocl, author = {Shaurya Rohatgi, Yanxia Qin, Benjamin Aw, Niranjana Unnithan, Min-Yen Kan}, title = {The ACL OCL Corpus: advancing Open science in Computational Linguistics}, howpublished = {arXiv}, year = {2022}, url = {https://huggingface.co/datasets/ACL-OCL/ACL-OCL-Corpus} }
贡献者
- 主要贡献者: @shauryr, Yanxia Qin, Benjamin Aw
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



