Sample data for "Design and Collection Challenges of Building an Academic Email Corpus for Linguistics and Computational Research"

Name: Sample data for "Design and Collection Challenges of Building an Academic Email Corpus for Linguistics and Computational Research"
Creator: arizona.figshare.com
Published: 2023-05-30 00:00:00
License: 暂无描述

arizona.figshare.com2023-05-30 更新2025-03-23 收录

下载链接：

https://arizona.figshare.com/articles/dataset/Sample_data_for_Design_and_Collection_Challenges_of_Building_an_Academic_Email_Corpus_for_Linguistics_and_Computational_Research_/14259785/1

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains anonymized email chains between students and instructors in an active learning-teaching relationship. The data was collected from several departments at the University of Arizona.The data is comprised of seven email chains (conversations) containing a total of 27 email texts. All data is presented in JSON text files. Some of the emails contain a few words in languages other than English, which is why we have encoded all files using UTF-8. We have enriched the email chains metadata with the gender, age range, first language(s), and additional language(s) as reposted by participants through a questionnaire. For most languages, we use the alpha-2 code ISO 639-1. For languages that are not present in the ISO 639-1, we report them as written by the participant. ISO 639-1 language codes can be found at the Library of Congress standards page here: https://www.loc.gov/standards/iso639-2/php/code_list.phpThe data is a sample taken from the Multilingual College Email Corpus (MCEC), an ongoing collection of authentic academic emails which recently started data collection in Fall 2020. We expect to publish the full corpus within this deposit at a future, undetermined date.The researchers have received IRB approval and participants' consent for publishing this data after anonymization under the following protocol name and number:"Collection and Analysis of Authentic College Emails"2004533142Our anonymization process can be found at: https://github.com/MCECorpus/MCEC-DeIDFor inquiries regarding the contents of this dataset, please contact the Corresponding Author listed in the README.txt file. Administrative inquiries (e.g., removal requests, trouble downloading, etc.) can be directed to data-management@arizona.edu

本数据集收录了活跃的师生学习互动中的匿名电子邮件链，数据来源于亚利桑那大学多个院系。数据由七条电子邮件链（对话）组成，共计27封电子邮件文本。所有数据均以JSON文本文件的形式呈现。部分邮件包含除英语以外的少量词汇，因此所有文件均采用UTF-8编码。我们通过问卷调查丰富了电子邮件链的元数据，包括参与者的性别、年龄范围、第一语言及额外使用的语言。对于大多数语言，我们采用ISO 639-1的字母-2代码。对于不在ISO 639-1中的语言，我们按照参与者的书写报告。ISO 639-1语言代码可查阅于美国国会图书馆标准页面：[https://www.loc.gov/standards/iso639-2/php/code_list.php](https://www.loc.gov/standards/iso639-2/php/code_list.php)。数据为多语言学院电子邮件语料库（MCEC）的样本，该语料库为持续收集的真实学术电子邮件，并于2020年秋季学期开始数据采集。我们预计在未来某个不确定的日期发布完整语料库。研究人员已获得IRB批准和参与者同意，在匿名化处理后以“真实学院电子邮件的收集与分析”为协议名称和编号（2004533142）发布此数据。匿名化过程可查阅于：[https://github.com/MCECorpus/MCEC-DeID](https://github.com/MCECorpus/MCEC-DeID)。如需查询本数据集的内容，请联系README.txt文件中列出的通讯作者。行政查询（例如，删除请求、下载问题等）可发送至data-management@arizona.edu。

提供机构：

arizona.figshare.com

5,000+

优质数据集

54 个

任务类型

进入经典数据集