coastalcph/medical-bios
收藏数据集描述
该数据集包含英文传记,标注了职业和二元性别。这是一个职业分类任务,可以研究性别偏见。数据集包含10,000条传记(8k训练集/1k开发集/1k测试集),针对5种医疗职业(心理学家、外科医生、护士、牙医、医生),源自De-Arteaga et al. (2019)。我们收集并发布了100条传记的人类理由注释,分为非对比和对比两种设置。在非对比设置中,注释者被要求找到理由:“为什么以下简短传记中的人被描述为L?”,其中L是黄金标签职业,例如护士。在对比设置中,问题是“为什么以下简短传记中的人被描述为L而不是F”,其中F(对照)是另一种医疗职业,例如医生。
数据集结构
我们提供standard版本的数据集,示例如下:
json { "text": "He has been a practicing Dentist for 20 years. He has done BDS. He is currently associated with Sree Sai Dental Clinic in Sowkhya Ayurveda Speciality Clinic, Chennai. ... ", "label": 3, }
以及新策划的包含人类理由的子集,称为rationales,示例如下:
json { "text": "She is currently practising at Dr Ravindra Ratolikar Dental Clinic in Narayanguda, Hyderabad.", "label": 3, "foil": 2, "words": [She, is, currently, practising, at, Dr, Ravindra, Ratolikar, Dental, Clinic, in, Narayanguda, ,, Hyderabad, .] "rationales": [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], "contrastive_rationales": [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] "annotations": [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ...] "contrastive_annotations": [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ...] }
使用方法
加载standard版本的数据集:
python from datasets import load_dataset dataset = load_dataset("coastalcph/medical-bios", "standard")
加载包含人类理由的新策划子集:
python from datasets import load_dataset dataset = load_dataset("coastalcph/medical-bios", "rationales")
引用
@inproceedings{eberle-etal-2023-rather, title = "Rather a Nurse than a Physician - Contrastive Explanations under Investigation", author = "Eberle, Oliver and Chalkidis, Ilias and Cabello, Laura and Brandl, Stephanie", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.427", }



