five

midas/metooma|社交媒体分析数据集|性别研究数据集

收藏
hugging_face2024-01-18 更新2024-06-15 收录
社交媒体分析
性别研究
下载链接:
https://hf-mirror.com/datasets/midas/metooma
下载链接
链接失效反馈
资源简介:
--- annotations_creators: - expert-generated language_creators: - found language: - en license: - cc0-1.0 multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - text-classification - text-retrieval task_ids: - multi-class-classification - multi-label-classification paperswithcode_id: metooma pretty_name: '#MeTooMA dataset' dataset_info: features: - name: TweetId dtype: string - name: Text_Only_Informative dtype: class_label: names: '0': Text Non Informative '1': Text Informative - name: Image_Only_Informative dtype: class_label: names: '0': Image Non Informative '1': Image Informative - name: Directed_Hate dtype: class_label: names: '0': Directed Hate Absent '1': Directed Hate Present - name: Generalized_Hate dtype: class_label: names: '0': Generalized Hate Absent '1': Generalized Hate Present - name: Sarcasm dtype: class_label: names: '0': Sarcasm Absent '1': Sarcasm Present - name: Allegation dtype: class_label: names: '0': Allegation Absent '1': Allegation Present - name: Justification dtype: class_label: names: '0': Justification Absent '1': Justification Present - name: Refutation dtype: class_label: names: '0': Refutation Absent '1': Refutation Present - name: Support dtype: class_label: names: '0': Support Absent '1': Support Present - name: Oppose dtype: class_label: names: '0': Oppose Absent '1': Oppose Present splits: - name: train num_bytes: 821738 num_examples: 7978 - name: test num_bytes: 205489 num_examples: 1995 download_size: 408889 dataset_size: 1027227 --- # Dataset Card for #MeTooMA dataset ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/JN4EYU - **Repository:** https://github.com/midas-research/MeTooMA - **Paper:** https://ojs.aaai.org//index.php/ICWSM/article/view/7292 - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Dataset Summary - The dataset consists of tweets belonging to #MeToo movement on Twitter, labelled into different categories. - This dataset includes more data points and has more labels than any of the previous datasets that contain social media posts about sexual abuse discloures. Please refer to the Related Datasets of the publication for a detailed information about this. - Due to Twitters development policies, the authors provide only the tweet IDs and corresponding labels, other data can be fetched via Twitter API. - The data has been labelled by experts, with the majority taken into the account for deciding the final label. - The authors provide these labels for each of the tweets. - Relevance - Directed Hate - Generalized Hate - Sarcasm - Allegation - Justification - Refutation - Support - Oppose - The definitions for each task/label is in the main publication. - Please refer to the accompanying paper https://aaai.org/ojs/index.php/ICWSM/article/view/7292 for statistical analysis on the textual data extracted from this dataset. - The language of all the tweets in this dataset is English - Time period: October 2018 - December 2018 - Suggested Use Cases of this dataset: - Evaluating usage of linguistic acts such as: hate-spech and sarcasm in the incontext of public sexual abuse discloures. - Extracting actionable insights and virtual dynamics of gender roles in sexual abuse revelations. - Identifying how influential people were potrayed on public platform in the events of mass social movements. - Polarization analysis based on graph simulations of social nodes of users involved in the #MeToo movement. ### Supported Tasks and Leaderboards Multi Label and Multi-Class Classification ### Languages English ## Dataset Structure - The dataset is structured into CSV format with TweetID and accompanying labels. - Train and Test sets are split into respective files. ### Data Instances Tweet ID and the appropriate labels ### Data Fields Tweet ID and appropriate labels (binary label applicable for a data point) and multiple labels for each Tweet ID ### Data Splits - Train: 7979 - Test: 1996 ## Dataset Creation ### Curation Rationale - Twitter was the major source of all the public discloures of sexual abuse incidents during the #MeToo movement. - People expressed their opinions over issues which were previously missing from the social media space. - This provides an option to study the linguistic behaviours of social media users in an informal setting, therefore the authors decide to curate this annotated dataset. - The authors expect this dataset would be of great interest and use to both computational and socio-linguists. - For computational linguists, it provides an opportunity to model three new complex dialogue acts (allegation, refutation, and justification) and also to study how these acts interact with some of the other linguistic components like stance, hate, and sarcasm. For socio-linguists, it provides an opportunity to explore how a movement manifests in social media. ### Source Data - Source of all the data points in this dataset is Twitter social media platform. #### Initial Data Collection and Normalization - All the tweets are mined from Twitter with initial search paramters identified using keywords from the #MeToo movement. - Redundant keywords were removed based on manual inspection. - Public streaming APIs of Twitter were used for querying with the selected keywords. - Based on text de-duplication and cosine similarity score, the set of tweets were pruned. - Non english tweets were removed. - The final set was labelled by experts with the majority label taken into the account for deciding the final label. - Please refer to this paper for detailed information: https://ojs.aaai.org//index.php/ICWSM/article/view/7292 #### Who are the source language producers? Please refer to this paper for detailed information: https://ojs.aaai.org//index.php/ICWSM/article/view/7292 ### Annotations #### Annotation process - The authors chose against crowd sourcing for labeling this dataset due to its highly sensitive nature. - The annotators are domain experts having degress in advanced clinical psychology and gender studies. - They were provided a guidelines document with instructions about each task and its definitions, labels and examples. - They studied the document, worked a few examples to get used to this annotation task. - They also provided feedback for improving the class definitions. - The annotation process is not mutually exclusive, implying that presence of one label does not mean the absence of the other one. #### Who are the annotators? - The annotators are domain experts having a degree in clinical psychology and gender studies. - Please refer to the accompnaying paper for a detailed annotation process. ### Personal and Sensitive Information - Considering Twitters policy for distribution of data, only Tweet ID and applicable labels are shared for the public use. - It is highly encouraged to use this dataset for scientific purposes only. - This dataset collection completely follows the Twitter mandated guidelines for distribution and usage. ## Considerations for Using the Data ### Social Impact of Dataset - The authors of this dataset do not intend to conduct a population centric analysis of #MeToo movement on Twitter. - The authors acknowledge that findings from this dataset cannot be used as-is for any direct social intervention, these should be used to assist already existing human intervention tools and therapies. - Enough care has been taken to ensure that this work comes of as trying to target a specific person for their personal stance of issues pertaining to the #MeToo movement. - The authors of this work do not aim to vilify anyone accused in the #MeToo movement in any manner. - Please refer to the ethics and discussion section of the mentioned publication for appropriate sharing of this dataset and social impact of this work. ### Discussion of Biases - The #MeToo movement acted as a catalyst for implementing social policy changes to benefit the members of community affected by sexual abuse. - Any work undertaken on this dataset should aim to minimize the bias against minority groups which might amplified in cases of sudden outburst of public reactions over sensitive social media discussions. ### Other Known Limitations - Considering privacy concerns, social media practitioners should be aware of making automated interventions to aid the victims of sexual abuse as some people might not prefer to disclose their notions. - Concerned social media users might also repeal their social information, if they found out that their information is being used for computational purposes, hence it is important seek subtle individual consent before trying to profile authors involved in online discussions to uphold personal privacy. ## Additional Information Please refer to this link: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/JN4EYU ### Dataset Curators - If you use the corpus in a product or application, then please credit the authors and [Multimodal Digital Media Analysis Lab - Indraprastha Institute of Information Technology, New Delhi] (http://midas.iiitd.edu.in) appropriately. Also, if you send us an email, we will be thrilled to know about how you have used the corpus. - If interested in commercial use of the corpus, send email to midas@iiitd.ac.in. - Multimodal Digital Media Analysis Lab - Indraprastha Institute of Information Technology, New Delhi, India disclaims any responsibility for the use of the corpus and does not provide technical support. However, the contact listed above will be happy to respond to queries and clarifications - Please feel free to send us an email: - with feedback regarding the corpus. - with information on how you have used the corpus. - if interested in having us analyze your social media data. - if interested in a collaborative research project. ### Licensing Information [More Information Needed] ### Citation Information Please cite the following publication if you make use of the dataset: https://ojs.aaai.org/index.php/ICWSM/article/view/7292 ``` @article{Gautam_Mathur_Gosangi_Mahata_Sawhney_Shah_2020, title={#MeTooMA: Multi-Aspect Annotations of Tweets Related to the MeToo Movement}, volume={14}, url={https://aaai.org/ojs/index.php/ICWSM/article/view/7292}, abstractNote={&lt;p&gt;In this paper, we present a dataset containing 9,973 tweets related to the MeToo movement that were manually annotated for five different linguistic aspects: relevance, stance, hate speech, sarcasm, and dialogue acts. We present a detailed account of the data collection and annotation processes. The annotations have a very high inter-annotator agreement (0.79 to 0.93 k-alpha) due to the domain expertise of the annotators and clear annotation instructions. We analyze the data in terms of geographical distribution, label correlations, and keywords. Lastly, we present some potential use cases of this dataset. We expect this dataset would be of great interest to psycholinguists, socio-linguists, and computational linguists to study the discursive space of digitally mobilized social movements on sensitive issues like sexual harassment.&lt;/p&#38;gt;}, number={1}, journal={Proceedings of the International AAAI Conference on Web and Social Media}, author={Gautam, Akash and Mathur, Puneet and Gosangi, Rakesh and Mahata, Debanjan and Sawhney, Ramit and Shah, Rajiv Ratn}, year={2020}, month={May}, pages={209-216} } ``` ### Contributions Thanks to [@akash418](https://github.com/akash418) for adding this dataset.
提供机构:
midas
原始信息汇总

#MeTooMA 数据集概述

数据集描述

数据集摘要

  • 数据集包含与#MeToo 运动相关的推文,分为不同类别。
  • 该数据集包含的数据点和标签比以往任何关于性侵犯披露的社交媒体数据集都要多。
  • 由于 Twitter 的开发政策,作者仅提供推文 ID 和相应的标签,其他数据可通过 Twitter API 获取。
  • 数据由专家标记,多数决定最终标签。
  • 每条推文提供以下标签:
    • 相关性
    • 定向仇恨
    • 普遍仇恨
    • 讽刺
    • 指控
    • 正当化
    • 反驳
    • 支持
    • 反对
  • 所有推文的语言为英语。
  • 时间范围:2018年10月至2018年12月。
  • 建议的使用案例:
    • 评估语言行为(如仇恨言论和讽刺)在公共性侵犯披露中的使用。
    • 提取性别角色在性侵犯披露中的可操作见解和虚拟动态。
    • 识别在社会运动事件中公众平台上影响力人物的描绘方式。
    • 基于#MeToo 运动中用户社交节点的图模拟进行极化分析。

支持的任务和排行榜

多标签和多类别分类。

语言

英语。

数据集结构

  • 数据集以 CSV 格式结构化,包含推文 ID 和相应的标签。
  • 训练集和测试集分别存储在各自的文件中。

数据实例

推文 ID 和相应的标签。

数据字段

  • TweetId: 字符串类型。
  • Text_Only_Informative: 类别标签,包括“文本非信息性”和“文本信息性”。
  • Image_Only_Informative: 类别标签,包括“图像非信息性”和“图像信息性”。
  • Directed_Hate: 类别标签,包括“定向仇恨缺席”和“定向仇恨存在”。
  • Generalized_Hate: 类别标签,包括“普遍仇恨缺席”和“普遍仇恨存在”。
  • Sarcasm: 类别标签,包括“讽刺缺席”和“讽刺存在”。
  • Allegation: 类别标签,包括“指控缺席”和“指控存在”。
  • Justification: 类别标签,包括“正当化缺席”和“正当化存在”。
  • Refutation: 类别标签,包括“反驳缺席”和“反驳存在”。
  • Support: 类别标签,包括“支持缺席”和“支持存在”。
  • Oppose: 类别标签,包括“反对缺席”和“反对存在”。

数据分割

  • 训练集:7978 条数据,821738 字节。
  • 测试集:1995 条数据,205489 字节。

数据集创建

策划理由

  • Twitter 是#MeToo 运动期间性侵犯事件公共披露的主要来源。
  • 人们表达了之前在社交媒体上缺失的意见。
  • 这为研究社交媒体用户在非正式环境中的语言行为提供了机会。
  • 作者期望该数据集对计算语言学家和社会语言学家都有很大兴趣和用途。

源数据

  • 所有数据点来自 Twitter 社交媒体平台。

注释

注释过程

  • 由于数据的高度敏感性,作者选择不使用众包进行标记。
  • 注释者是具有高级临床心理学和性别研究学位的领域专家。
  • 他们被提供了一个包含每个任务及其定义、标签和示例的指南文档。
  • 注释过程不是互斥的,即一个标签的存在并不意味着另一个标签的缺失。

注释者

  • 注释者是具有临床心理学和性别研究学位的领域专家。

个人和敏感信息

  • 考虑到 Twitter 的数据分发政策,仅共享推文 ID 和适用的标签供公众使用。
  • 强烈建议仅将此数据集用于科学目的。

使用数据的注意事项

数据集的社会影响

  • 作者不打算对#MeToo 运动在 Twitter 上的影响进行人口统计分析。
  • 该数据集的发现不应直接用于任何直接的社会干预,而应辅助现有的社会干预工具和疗法。

偏见的讨论

  • #MeToo 运动作为推动社会政策变革的催化剂,旨在造福受性侵犯影响的社区成员。
  • 任何基于此数据集的工作都应旨在减少对少数群体的偏见。

其他已知限制

  • 考虑到隐私问题,社交媒体从业者应避免对性侵犯受害者进行自动干预。
  • 关注的社会媒体用户可能会撤回其社交信息,如果他们发现其信息被用于计算目的。

附加信息

数据集策展人

  • 如果您在产品或应用程序中使用该语料库,请适当致谢作者和 Multimodal Digital Media Analysis Lab - Indraprastha Institute of Information Technology, New Delhi。
  • 如果您对语料库有任何反馈或合作意向,请随时与我们联系。

许可信息

[更多信息待补充]

引用信息

如果您使用该数据集,请引用以下出版物:https://ojs.aaai.org/index.php/ICWSM/article/view/7292

AI搜集汇总
数据集介绍
main_image_url
构建方式
该数据集的构建以Twitter平台上的#MeToo运动推文为对象,采用专家标注的方式进行。首先,通过关键词搜索和API调用收集相关推文,然后经过去重和筛选,确保数据的质量和相关性。专家们在详细的标注指南指导下,对推文进行多标签标注,涵盖了从立场、仇恨言论到讽刺等不同维度,以捕获推文中的复杂语言行为。
特点
数据集的特点在于其丰富性和细粒度的标注。它包含了9,973条与#MeToo运动相关的推文,每条推文都被标注了多个维度,如相关性、立场、仇恨言论、讽刺以及对话行为。数据集的标注由领域专家完成,确保了高质量和高度的一致性(0.79至0.93的k-alpha一致性)。此外,数据集的构建严格遵守了Twitter的数据使用政策,仅提供推文ID和标签,保证了用户的隐私。
使用方法
使用该数据集时,用户需要通过Twitter API获取推文的具体内容。数据集提供了训练集和测试集,适用于多标签和多类别分类任务。用户可以借助该数据集来评估在公共性侵犯揭露背景下语言行为的使用情况,分析性别角色在性侵犯揭露中的虚拟动态,以及在社会运动中公众人物的影响力展现等。在使用数据集时,应注意其社会影响,避免对特定个人或群体产生偏见或伤害。
背景与挑战
背景概述
‘#MeTooMA dataset’是一个关于#MeToo运动推文的注释数据集,由印度新德里Indraprastha信息科技学院的Midas数字媒体分析实验室创建于2020年。该数据集由专家生成,包含9973条与MeToo运动相关的推文,标注了五个不同的语言特征:相关性、立场、仇恨言论、讽刺和对话行为。数据集涵盖了2018年10月至12月期间的信息,主要研究问题是如何在社交媒体环境下,对性骚扰等敏感问题进行数字化动员的言语分析。该数据集对心理语言学、社会语言学和计算语言学领域的研究人员具有很高的研究价值,有助于深入理解社交媒体上数字化社会运动的言辞空间。
当前挑战
该数据集在构建过程中遇到的挑战主要包括:如何准确地从大量社交媒体数据中筛选出与#MeToo运动相关的内容,以及如何确保标注过程的高质量和高一致性。此外,由于Twitter的数据使用政策限制,数据集仅提供了推文ID和相应的标签,而原始推文内容需要通过Twitter API获取,这为数据的使用带来了一定的限制。数据集相关的挑战还包括如何处理和平衡数据中的个人隐私问题,以及如何避免在分析过程中放大对少数群体的偏见。
常用场景
经典使用场景
在涉及性骚扰揭露的社交媒体语境中,#MeTooMA数据集的典型应用场景是对推文进行多标签分类,以识别和量化与#MeToo运动相关的不同语言行为,如立场、仇恨言论、讽刺、对话行为等。该数据集特别适用于评估语言行为如何在公共性骚扰揭露的背景下交互作用,以及性别角色在性骚扰揭露中的虚拟动态。
衍生相关工作
#MeTooMA数据集催生了一系列相关研究,包括对社交媒体上性别角色的分析、对仇恨言论和讽刺的识别研究,以及基于该数据集构建的心理语言模型,这些工作进一步拓展了数据集的应用范围,并对其进行了深度挖掘。
数据集最近研究
最新研究方向
在#MeTooMA数据集的最新研究中,学者们专注于挖掘社交媒体上关于性骚扰披露的复杂语言行为,如指控、反驳和辩解等对话行为,及其与立场、仇恨言论和讽刺等语言成分的交互作用。此数据集以其丰富的标注类别和高度的专业标注一致性(0.79至0.93的k-alpha系数),为心理语言学、社会语言学和计算语言学领域的研究者提供了一个独特的研究平台,以探索数字化的社会运动在敏感议题上的话语空间。研究者们正在利用这一资源,进行性别角色在性骚扰揭露中的虚拟动态分析,以及公众人物在大型社会运动中的形象描绘等研究,旨在为已有的人干预工具和疗法提供辅助,并推动社会政策的变革。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

Wind Turbine Data

该数据集包含风力涡轮机的运行数据,包括风速、风向、发电量等参数。数据记录了多个风力涡轮机在不同时间点的运行状态,适用于风能研究和风力发电系统的优化分析。

www.kaggle.com 收录

BBGRE

The Brain & Body Genetic Resource Exchange (BBGRE) provides a resource for investigating the genetic basis of neurodisability. It combines phenotype information from patients with neurodevelopmental and behavioural problems with clinical genetic data, and displays this information on the human genome map.

国家生物信息中心 收录

中国气象数据

本数据集包含了中国2023年1月至11月的气象数据,包括日照时间、降雨量、温度、风速等关键数据。通过这些数据,可以深入了解气象现象对不同地区的影响,并通过可视化工具揭示中国的气温分布、降水情况、风速趋势等。

github 收录

广东省标准地图

该数据类主要为广东省标准地图信息。标准地图依据中国和世界各国国界线画法标准编制而成。该数据包括广东省全图、区域地图、地级市地图、县(市、区)地图、专题地图、红色印迹地图等分类。

开放广东 收录

全国景区数据

  中华人民共和国旅游景区质量等级共分为五级,从高到低依次为AAAAA、AAAA、AAA、AA、A级五级。5A级景区代表着中国的世界级精品旅游风景区等级。  CnOpenData汇总整理了全国31个省份及直辖市的景区信息,涵盖了景区名称、省份、景区级别、地址、经纬度、简介等字段,为相关研究助力!

CnOpenData 收录