five

SARA - A Collection of Sensitivity-Aware Relevance Assessments

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/8006819
下载链接
链接失效反馈
官方服务:
资源简介:
SARA - A Collection of Sensitivity-Aware Relevance Assessments Presented here is a collection of Sensitivity-Aware Relevance Assessments for the UC Berkely labelled subset of the Enron Email Collection. The Hearst [1] labelled version of the Enron Email Collection is a subset of the CMU collection that contains 1702 emails that were annotated as part of a class project at UC Berkley. Students in the Natural Language Processing course were tasked with annotating the emails as relevant or not relevant to 53 different categories. Therefore, the labelled version of the Enron email collection provides a rich taxonomy of labels which can be used for multiple definitions of sensitivity such as the Purely Personal and Personal but in a Professional Context. The categories that the emails are labelled for can be seen in [Table 1](#table-1). The files for the labelled version of the Enron Email Collection are available from the UC Berkely website. We deploy a topic modelling approach to identify topical themes in the labelled Enron collection that serve as a basis for our information needs which are in turn used to gather queries and relevance assessments, the notebook for which is available here. Two separate crowdsourcing tasks are carried out in the development of SARA. Firstly, query formulations are crowdsourced to represent the information needs and, secondly, relevance assessments are crowdsourced for a pooled set of documents from the labelled Enron collection for each of the information needs. The SARA Collection of Sensitivity-Aware Relevance Assessments is available through the popular ir_datasets library. More information can be found on the ir_datasets GitHub and website. Information Needs To create our set of sensitivity-aware relevance assessments for the labelled Enron email collection, we first identify a set of topical subjects that reflect the contents of the emails in the collection. We use a topic modelling approach to identify the information needs. When identifying topics to be used as information needs, we are interested in identifying general themes that relate to the topics of discussion that might likely be covered in the contents (i.e., the body) of the emails in the collection. The topics are chosen to be broad enough to be able to reasonably expect that there would be relevant documents in the collection, and not so specific that it would require specialist knowledge to make a judgement of relevance on the subject. Subsequently, we manually construct short passages of text to serve as descriptions of the information needs that are to be searched for in the collection by the crowdworkers. The information needs that the crowdworkers are available in the information_needs.tsv file. Queries In order to collect relevance assessments for pairs of emails and information needs, different query formulations are first needed to generate pools of documents. Query formulations for each topic are collected from crowdworkers from the Prolific crowdwork platform. Ten information needs are shown to each crowdworker and they are asked to provide a query formulation that they would use to get relevant documents to satisfy the information need they are presented with. Three queries for each of the fifty information needs are released. The resulting queries are available in the repeated_queries.tsv file. Relevance Assesments Crowdworkers are shown an information need and an email and asked to rate the document as being either Highly Relevant, Partially Relevant, or Not Relevant to the information need. Each information need/email pair is judged by three crowdworkers and a majority vote is used to generate a ground truth label. Since each information need / email pair is judged by three crowdworkers and there are three possible labels, it is possible for each of the labels to be selected by one crowdworker. In practice, this only happened for 134 pairs. In such cases, ties are broken by having one of the authors read the document and make an additional judgement. In order to ensure that sensitive documents definitely have relevance labels they were also judged by one of the authors for each of the information needs. The relevance assessments are available in the repeated_qrels.txt file. The relevance assessments are in the format 'query iteration document relevancy'. The iteration column is used for IR_Datasets and can be safely ignored and the document name is the filename used in the labelled Enron collection. Table 1 1) Coarse genre 2) Included/forwarded information 3) Primary topics (If coarse genre 1.1 is selected) 4) Emotional tone (If not neutral) 1.1 Company Business, Strategy, etc. (See 3) 2.1 Includes new text in addition to forwarded material 3.1 Regulations and regulators (includes price caps) 4.1 Jubilation 1.2 Purely Personal 2.2 Forwarded email(s) including replies 3.2 Internal projects -- progress and strategy 4.2 Hope / anticipation 1.3 Personal but in professional context (e.g., it was good working with you) 2.3 Business letter(s) / document(s) 3.3 Company image -- current 4.3 Humor 1.4 Logistic Arrangements (meeting scheduling, technical support, etc.) 2.4 News article(s) 3.4 Company image -- changing / influencing 4.4 Camaraderie 1.5 Employment arrangements (job seeking, hiring, recommendations, etc.) 2.5 Government / academic report(s) 3.5 Political influence / contributions / contacts 4.5 Admiration 1.6 Document editing/checking (collaboration) 2.6 Government action(s) (such as results of a hearing, etc.) 3.6 California energy crisis / California politics 4.6 Gratitude 1.7 Empty message (due to missing attachment) 2.7 Press release(s) 3.7 Internal company policy 4.7 Friendship / affection 1.8 Empty message 2.8 Legal documents (complaints, lawsuits, advice) 3.8 Internal company operations 4.8 Sympathy / support   2.9 Pointers to url(s) 3.9 Alliances / partnerships 4.9 Sarcasm   2.10 Newsletters 3.10 Legal advice 4.10 Secrecy / confidentiality   2.11 Jokes, humor (related to business) 3.11 Talking points 4.11 Worry / anxiety   2.12 Jokes, humor (unrelated to business) 3.12 Meeting minutes 4.12 Concern   2.13 Attachment(s) (assumed missing) 3.13 Trip reports 4.13 Competitiveness / aggressiveness       4.14 Triumph / gloating       4.15 Pride       4.16 Anger / agitation       4.17 Sadness / despair       4.18 Shame       4.19 Dislike / scorn     The Sensitivity-Aware Relevance Assessments dataset is held under an Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) licence which allows for it to be adapted, transformed and built upon. Questions and comments are welcomed via email. References [1] Marti A Hearst. 2005. Teaching applied natural language processing: Triumphs and tribulations. In Proc. of Workshop on Effective Tools and Methodologies for Teaching NLP and CL.
创建时间:
2023-06-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作