midas/hindi_discourse
收藏数据集概述
数据集描述
- 数据集名称: Discourse Analysis dataset
- 语言: 印地语 (Hindi)
- 许可证: 其他
- 多语言性: 单语种
- 大小类别: 1K<n<10K
- 源数据集: 原始数据
- 任务类别: 文本分类
- 任务ID: 多标签分类
- 标签: 话语分析
数据集结构
特征
- Story_no: 故事编号,数据类型为 int32
- Sentence: 句子,数据类型为 string
- Discourse Mode: 话语模式,数据类型为 class_label,包含以下类别:
- Argumentative
- Descriptive
- Dialogue
- Informative
- Narrative
- Other
数据分割
- 训练集 (train): 包含 9968 个样本,总字节数为 1998930
数据实例
json { "Story_no": 15, "Sentence": " गाँठ से साढ़े तीन रुपये लग गये, जो अब पेट में जाकर खनकते भी नहीं! जो तेरी करनी मालिक! ” “इसमें मालिक की क्या करनी है? ”", "Discourse Mode": "Dialogue" }
数据集创建
数据收集与规范化
- 数据来源: 印地语故事,由著名作者撰写
- 数据收集: 从多个印地语网站收集
- 标注人员: 三位具有大学教育水平的印地语母语者
标注过程
- 标注人员: 三位印地语母语者,具有高级临床心理学和性别研究学位
- 标注指南: 提供详细的任务说明、定义、标签和示例
- 标注过程: 非互斥,即一个标签的存在并不排除其他标签的存在
使用数据集的注意事项
社会影响
- 未来工作: 使用该语料库进行情感分析、机器翻译、文本蕴含和语音合成等下游任务
已知限制
- 深度学习模型性能受限于数据量不足
附加信息
数据集引用
bibtex @inproceedings{dhanwal-etal-2020-annotated, title = "An Annotated Dataset of Discourse Modes in {H}indi Stories", author = "Dhanwal, Swapnil and Dutta, Hritwik and Nankani, Hitesh and Shrivastava, Nilay and Kumar, Yaman and Li, Junyi Jessy and Mahata, Debanjan and Gosangi, Rakesh and Zhang, Haimin and Shah, Rajiv Ratn and Stent, Amanda", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.lrec-1.149", pages = "1191--1196", abstract = "In this paper, we present a new corpus consisting of sentences from Hindi short stories annotated for five different discourse modes argumentative, narrative, descriptive, dialogic and informative. We present a detailed account of the entire data collection and annotation processes. The annotations have a very high inter-annotator agreement (0.87 k-alpha). We analyze the data in terms of label distributions, part of speech tags, and sentence lengths. We characterize the performance of various classification algorithms on this dataset and perform ablation studies to understand the nature of the linguistic models suitable for capturing the nuances of the embedded discourse structures in the presented corpus.", language = "English", ISBN = "979-10-95546-34-4", }



