WINGNUS/ACL-OCL

Name: WINGNUS/ACL-OCL
Creator: WINGNUS
Published: 2023-09-21 00:57:32
License: 暂无描述

Hugging Face2023-09-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/WINGNUS/ACL-OCL

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: [] language: - en language_creators: - found license: - mit multilinguality: - monolingual paperswithcode_id: acronym-identification pretty_name: acl-ocl-corpus size_categories: - 10K<n<100K source_datasets: - original tags: - research papers - acl task_categories: - token-classification task_ids: [] train-eval-index: - col_mapping: labels: tags tokens: tokens config: default splits: eval_split: test task: token-classification task_id: entity_extraction --- # Dataset Card for ACL Anthology Corpus [![License](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/) This repository provides full-text and metadata to the ACL anthology collection (80k articles/posters as of September 2022) also including .pdf files and grobid extractions of the pdfs. ## How is this different from what ACL anthology provides and what already exists? - We provide pdfs, full-text, references and other details extracted by grobid from the PDFs while [ACL Anthology](https://aclanthology.org/anthology+abstracts.bib.gz) only provides abstracts. - There exists a similar corpus call [ACL Anthology Network](https://clair.eecs.umich.edu/aan/about.php) but is now showing its age with just 23k papers from Dec 2016. ```python >>> import pandas as pd >>> df = pd.read_parquet('acl-publication-info.74k.parquet') >>> df acl_id abstract full_text corpus_paper_id pdf_hash ... number volume journal editor isbn 0 O02-2002 There is a need to measure word similarity whe... There is a need to measure word similarity whe... 18022704 0b09178ac8d17a92f16140365363d8df88c757d0 ... None None None None None 1 L02-1310 8220988 8d5e31610bc82c2abc86bc20ceba684c97e66024 ... None None None None None 2 R13-1042 Thread disentanglement is the task of separati... Thread disentanglement is the task of separati... 16703040 3eb736b17a5acb583b9a9bd99837427753632cdb ... None None None None None 3 W05-0819 In this paper, we describe a word alignment al... In this paper, we describe a word alignment al... 1215281 b20450f67116e59d1348fc472cfc09f96e348f55 ... None None None None None 4 L02-1309 18078432 011e943b64a78dadc3440674419821ee080f0de3 ... None None None None None ... ... ... ... ... ... ... ... ... ... ... ... 73280 P99-1002 This paper describes recent progress and the a... This paper describes recent progress and the a... 715160 ab17a01f142124744c6ae425f8a23011366ec3ee ... None None None None None 73281 P00-1009 We present an LFG-DOP parser which uses fragme... We present an LFG-DOP parser which uses fragme... 1356246 ad005b3fd0c867667118482227e31d9378229751 ... None None None None None 73282 P99-1056 The processes through which readers evoke ment... The processes through which readers evoke ment... 7277828 924cf7a4836ebfc20ee094c30e61b949be049fb6 ... None None None None None 73283 P99-1051 This paper examines the extent to which verb d... This paper examines the extent to which verb d... 1829043 6b1f6f28ee36de69e8afac39461ee1158cd4d49a ... None None None None None 73284 P00-1013 Spoken dialogue managers have benefited from u... Spoken dialogue managers have benefited from u... 10903652 483c818c09e39d9da47103fbf2da8aaa7acacf01 ... None None None None None [73285 rows x 21 columns] ``` ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** https://github.com/shauryr/ACL-anthology-corpus - **Point of Contact:** shauryr@gmail.com ### Dataset Summary Dataframe with extracted metadata (table below with details) and full text of the collection for analysis : **size 489M** ### Languages en, zh and others ## Dataset Structure Dataframe ### Data Instances Each row is a paper from ACL anthology ### Data Fields | **Column name** | **Description** | | :---------------: | :---------------------------: | | `acl_id` | unique ACL id | | `abstract` | abstract extracted by GROBID | | `full_text` | full text extracted by GROBID | | `corpus_paper_id` | Semantic Scholar ID | | `pdf_hash` | sha1 hash of the pdf | | `numcitedby` | number of citations from S2 | | `url` | link of publication | | `publisher` | - | | `address` | Address of conference | | `year` | - | | `month` | - | | `booktitle` | - | | `author` | list of authors | | `title` | title of paper | | `pages` | - | | `doi` | - | | `number` | - | | `volume` | - | | `journal` | - | | `editor` | - | | `isbn` | - | ## Dataset Creation The corpus has all the papers in ACL anthology - as of September'22 ### Source Data - [ACL Anthology](aclanthology.org) - [Semantic Scholar](semanticscholar.org) # Additional Information ### Licensing Information The ACL OCL corpus is released under the [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). By using this corpus, you are agreeing to its usage terms. ### Citation Information If you use this corpus in your research please use the following BibTeX entry: @Misc{acl-ocl, author = {Shaurya Rohatgi, Yanxia Qin, Benjamin Aw, Niranjana Unnithan, Min-Yen Kan}, title = {The ACL OCL Corpus: advancing Open science in Computational Linguistics}, howpublished = {arXiv}, year = {2022}, url = {https://huggingface.co/datasets/ACL-OCL/ACL-OCL-Corpus} } ### Acknowledgements We thank Semantic Scholar for providing access to the citation-related data in this corpus. ### Contributions Thanks to [@shauryr](https://github.com/shauryr), [Yanxia Qin](https://github.com/qolina) and [Benjamin Aw](https://github.com/Benjamin-Aw-93) for adding this dataset.

提供机构：

WINGNUS

原始信息汇总

数据集概述

数据集名称

名称: ACL Anthology Corpus
别名: acl-ocl-corpus

数据集描述

摘要: 该数据集包含ACL文集的元数据和全文，用于分析。数据集大小为489M。
语言: 主要语言为英语（en），以及其他语言。

数据集结构

数据实例: 每行代表ACL文集的一篇论文。
数据字段:
- acl_id: 唯一ACL标识符
- abstract: 通过GROBID提取的摘要
- full_text: 通过GROBID提取的全文
- corpus_paper_id: Semantic Scholar ID
- pdf_hash: PDF文件的SHA1哈希值
- numcitedby: 从S2获得引用次数
- url: 出版物链接
- publisher: 出版商
- address: 会议地址
- year: 年份
- month: 月份
- booktitle: 书名
- author: 作者列表
- title: 论文标题
- pages: 页码
- doi: DOI
- number: 编号
- volume: 卷号
- journal: 期刊名称
- editor: 编辑
- isbn: ISBN

数据集创建

源数据: 数据来源于ACL Anthology和Semantic Scholar。

许可证信息

许可证: 该数据集根据CC BY-NC 4.0许可证发布。

引用信息

引用格式:

@Misc{acl-ocl, author = {Shaurya Rohatgi, Yanxia Qin, Benjamin Aw, Niranjana Unnithan, Min-Yen Kan}, title = {The ACL OCL Corpus: advancing Open science in Computational Linguistics}, howpublished = {arXiv}, year = {2022}, url = {https://huggingface.co/datasets/ACL-OCL/ACL-OCL-Corpus} }

贡献者

主要贡献者: @shauryr, Yanxia Qin, Benjamin Aw

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集