five

DEPCC

收藏
arXiv2018-03-01 更新2024-06-21 收录
下载链接:
https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/depcc.html
下载链接
链接失效反馈
官方服务:
资源简介:
DEPCC数据集是目前最大的英语语言分析语料库,由汉堡大学信息学系语言技术组创建。该数据集包含365百万文档,总计2520亿个词条和75亿个命名实体,这些数据来源于COMMON CRAWL项目的网络规模爬虫。数据集的创建过程涉及使用MapReduce框架进行可扩展的软件实现,以及对文档进行依赖解析和命名实体标记。DEPCC数据集的应用领域广泛,包括训练基于语法的词嵌入、开放信息抽取和问答系统等,旨在解决自然语言处理中的大规模数据需求问题。

The DEPCC dataset is currently the largest English-language analysis corpus, developed by the Language Technology Group within the Department of Informatics at the University of Hamburg. Comprising 365 million documents, the dataset totals 252 billion tokens and 7.5 billion named entities, with all data sourced from the web-scale crawls of the Common Crawl project. Its construction involves scalable software implementations built on the MapReduce framework, alongside dependency parsing and named entity tagging for all included documents. The DEPCC dataset finds wide applications across various domains, including training grammar-based word embeddings, open information extraction, and question answering systems, among others, and is designed to address the demand for large-scale data in natural language processing (NLP).
提供机构:
汉堡大学信息学系语言技术组
创建时间:
2017-10-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作