Aisha/BAAD6
收藏Hugging Face2022-10-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Aisha/BAAD6
下载链接
链接失效反馈官方服务:
资源简介:
BAAD6是一个用于孟加拉文学作者归属的数据集,由Hemayet等人收集和分析。数据来源于不同的在线帖子和博客,包含6位作者,每位作者有350个样本文本。数据集在作者之间是平衡的,但由于数据来源和清理过程,数据集相对较小且存在噪声。尽管如此,它仍有助于评估作者归属系统,因为它类似于互联网上常见的文本。
BAAD6 is a dataset for Bengali literary author attribution, collected and analyzed by Hemayet et al. The data is sourced from various online posts and blogs, covering six authors, with 350 sample texts per author. The dataset is balanced across all authors, yet it is relatively small in scale and contains noise due to its data sources and cleaning process. Nevertheless, it serves as a valuable resource for evaluating author attribution systems, as it mirrors the type of text commonly encountered on the internet.
提供机构:
Aisha
原始信息汇总
数据集概述
名称: BAAD6: Bangla Authorship Attribution Dataset (6 Authors)
语言: 孟加拉语 (bn)
许可证: CC-BY-4.0
多语言性: 单语种
任务类别: 文本分类
任务ID: 多类分类
来源: 原始数据
描述: BAAD6是一个针对孟加拉文学的作者归属数据集。该数据集包含6位作者的文本,每位作者350个样本,总计2100个样本。数据来源于不同的在线帖子和博客,具有一定的噪音,但可用于评估作者归属系统。
数据集详细信息
| 作者 | 样本数 | 单词数 | 独特单词数 |
|---|---|---|---|
| fe | 350 | 357k | 53k |
| ij | 350 | 391k | 72k |
| mk | 350 | 377k | 47k |
| rn | 350 | 231k | 50k |
| hm | 350 | 555k | 72k |
| rg | 350 | 391k | 58k |
| 总计 | 2,100 | 2,304,338 | 230,075 |
| 平均 | 350 | 384,056.33 | 59,006.67 |
引用信息
若使用此数据集,请引用以下文献:
@INPROCEEDINGS{BAAD6Dataset, author={Ahmed Chowdhury, Hemayet and Haque Imon, Md. Azizul and Islam, Md. Saiful}, booktitle={2018 21st International Conference of Computer and Information Technology (ICCIT)}, title={A Comparative Analysis of Word Embedding Representations in Authorship Attribution of Bengali Literature}, year={2018}, pages={1-6}, doi={10.1109/ICCITECHN.2018.8631977} }



