DOSA

github2024-05-10 更新2024-05-31 收录

下载链接：

https://github.com/microsoft/DOSA

下载链接

链接失效反馈

官方服务：

资源简介：

DOSA：一个来自不同印度地理亚文化的社会文物数据集

DOSA: A Dataset of Social Artifacts from Diverse Indian Subcultures

创建时间：

2024-02-23

原始信息汇总

数据集概述

名称: DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures

描述: DOSA 是一个包含615个社会文化物品的数据集，这些物品来自印度19个不同的地理亚文化区域。该数据集通过参与式研究方法，与260名参与者合作，使用基于集体意义构建的游戏化框架收集了物品的名称和描述。

数据集用途

该数据集用于评估大型语言模型（LLMs）在不同地区亚文化中的表现，特别是在理解社会文化物品方面的能力。

引用信息

若使用该数据集或相关代码，请使用以下引用格式：

@inproceedings{seth-etal-2024-dosa-dataset, title = "{DOSA}: A Dataset of Social Artifacts from Different {I}ndian Geographical Subcultures", author = "Seth, Agrima and Ahuja, Sanchit and Bali, Kalika and Sitaram, Sunayana", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.474", pages = "5323--5337", abstract = "Generative models are increasingly being used in various applications, such as text generation, commonsense reasoning, and question-answering. To be effective globally, these models must be aware of and account for local socio-cultural contexts, making it necessary to have benchmarks to evaluate the models for their cultural familiarity. Since the training data for LLMs is web-based and the Web is limited in its representation of information, it does not capture knowledge present within communities that are not on the Web. Thus, these models exacerbate the inequities, semantic misalignment, and stereotypes from the Web. There has been a growing call for community-centered participatory research methods in NLP. In this work, we respond to this call by using participatory research methods to introduce DOSA, the first community-generated Dataset of 615 Social Artifacts, by engaging with 260 participants from 19 different Indian geographic subcultures. We use a gamified framework that relies on collective sensemaking to collect the names and descriptions of these artifacts such that the descriptions semantically align with the shared sensibilities of the individuals from those cultures. Next, we benchmark four popular LLMs and find that they show significant variation across regional sub-cultures in their ability to infer the artifacts.", }

搜集汇总

数据集介绍

构建方式

DOSA数据集的构建基于一种创新的参与式研究方法，旨在捕捉印度不同地理亚文化中的社会文物。通过与260名来自19个不同印度地理亚文化的参与者互动，数据集收集了615个社会文物的名称和描述。这些描述通过集体感知框架生成，确保其语义与参与者共享的文化敏感性相一致。此方法不仅增强了数据集的文化代表性，还为评估语言模型在区域文化理解上的表现提供了基准。

使用方法

使用DOSA数据集时，用户需先创建并激活conda环境，通过运行`create_env.py`脚本并激活`dosa`环境。随后，设置环境变量如`OPENAI_API_KEY`和`HF_TOKEN`，并确保`PYTHONPATH`正确配置。数据集主要用于评估和训练语言模型，特别是那些需要理解区域文化背景的模型。通过基准测试，用户可以评估模型在不同文化背景下的表现，从而优化其跨文化适应性。

背景与挑战

背景概述

DOSA数据集，全称为‘Dataset of Social Artifacts from Different Indian Geographical Subcultures’，是由Microsoft研究团队于2024年创建的。该数据集的核心研究问题在于通过收集和描述来自印度不同地理亚文化的社会文物，以评估和提升大型语言模型（LLMs）对区域文化背景的理解能力。DOSA数据集通过参与式研究方法，与260名来自19个不同印度地理亚文化的参与者合作，收集了615个社会文物的名称和描述。这一数据集的创建不仅填补了现有数据集中文化多样性不足的空白，还为全球范围内的文化敏感性模型评估提供了新的基准。

当前挑战

DOSA数据集在构建过程中面临了多重挑战。首先，如何确保数据收集过程中的文化敏感性和准确性是一个关键问题，因为不同亚文化之间的社会文物可能存在显著差异。其次，数据集的多样性和代表性也是一个挑战，需要确保涵盖尽可能多的印度地理亚文化，以避免文化偏见和遗漏。此外，数据集的有效性和实用性也需通过严格的验证和测试，以确保其能够为大型语言模型提供有意义的训练和评估数据。最后，如何在数据集中平衡不同文化背景的复杂性和模型的可解释性，也是一个亟待解决的问题。

常用场景

经典使用场景

DOSA数据集的经典使用场景主要集中在社会文化语境下的自然语言处理任务中。该数据集通过收集来自印度不同地理亚文化的615种社会文物，为模型提供了丰富的文化背景信息。研究者可以利用DOSA数据集进行跨文化语义对齐、文化敏感性评估以及生成模型在不同文化背景下的表现分析。

解决学术问题

DOSA数据集解决了当前自然语言处理领域中，大型语言模型在处理非主流文化信息时的不足。通过提供来自印度不同地理亚文化的社会文物数据，DOSA帮助研究者评估和提升模型在多文化环境下的表现，从而减少文化偏见和语义错位，推动全球范围内的文化包容性研究。

实际应用

在实际应用中，DOSA数据集可用于开发和优化面向多文化用户的智能系统，如文化敏感的对话系统、跨文化教育工具和本地化内容生成器。这些应用能够更好地理解和适应不同文化背景的用户需求，提升用户体验，促进文化交流与理解。

数据集最近研究