five

Webis Gmane Email Corpus 2019

收藏
NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/3766984
下载链接
链接失效反馈
官方服务:
资源简介:
The Webis Gmane Email Corpus 2019 is a dataset of more than 153 million parsed and segmented emails crawled between February and May 2019 from gmane.io covering more than 20 years of public mailing lists. The dataset has been published as a resource at ACL 2020. The dataset comes as a set of Gzip-compressed files containing line-based JSON in the Elasticsearch bulk format. Each data record consists of two lines: {"index": {"_id": ""}} {"headers": {"header name": "header value", ...}, "text_plain": "plaintext body", "lang": "en", "segments": [{"end": 99, "label": "paragraph", "begin": 0}, ...], "group": "gmane group name"} The first line is the Elasticsearch index action with a document UUID, the second one the actual parsed email with a (reduced and anonymized) set of headers, the detected language, the original Gmane group name and the predicted content segments as character spans. The Gzip files are splittable every 1,000 records (line pairs) for parallel processing in, e.g., Hadoop. Available email headers are: message_id date (yyyy-MM-dd HH:mm:ssZZ) subject from to cc in_reply_to references list_id Available segment classes are: paragraph closing inline_headers log_data mua_signature patch personal_signature quotation quotation_marker raw_code salutation section_heading tabular technical visual_separator Find more information about the dataset and the segmentation model at webis.de. If you are using this resource in your work, please cite it as: @InProceedings{stein:2020o, author = {Janek Bevendorff and Khalid Al-Khatib and Martin Potthast and Benno Stein}, booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL 2020)}, month = jul, publisher = {Association for Computational Linguistics}, site = {Seattle, USA}, title = {{Crawling and Preprocessing Mailing Lists At Scale for Dialog Analysis}}, year = 2020 }
创建时间:
2020-06-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作