Aisha/BAAD16
收藏Hugging Face2022-10-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Aisha/BAAD16
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- found
- crowdsourced
- expert-generated
language_creators:
- found
- crowdsourced
language:
- bn
license:
- cc-by-4.0
multilinguality:
- monolingual
pretty_name: 'BAAD16: Bangla Authorship Attribution Dataset (16 Authors)'
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- multi-class-classification
---
## Description
**BAAD16** is an **Authorship Attribution dataset for Bengali Literature**. It was collected and analyzed by the authors of [this paper](https://arxiv.org/abs/2001.05316). It was created by scraping text from an online Bangla e-library using custom web crawler and contains literary works of various famous Bangla writers. It contains novels, stories, series, and other works of 16 authors. Each sample document is created with 750 words. The dataset is imbalanced and resembles real-world scenarios more closely, where not all the authors will have a large number of sample texts. The following table gives more details about the dataset.
| Author Name | Number of Samples | Word Count | Unique Word
| --- | --- | --- | --- |
| zahir rayhan | 185 | 138k | 20k
|nazrul | 223 | 167k | 33k
|manik bandhopaddhay | 469 | 351k | 44k
|nihar ronjon gupta | 476 | 357k | 43k
|bongkim | 562 | 421k | 62k
|tarashonkor | 775 | 581k | 84k
|shottojit roy | 849 | 636k | 67k
|shordindu | 888 | 666k | 84k
|toslima nasrin | 931 | 698k | 76k
|shirshendu | 1048 | 786k | 69k
|zafar iqbal | 1100 | 825k | 53k
|robindronath | 1259 | 944k | 89k
|shorotchandra | 1312 | 984k | 78k
|shomresh | 1408 | 1056k|69k
|shunil gongopaddhay | 1963 | 1472k|109k
|humayun ahmed | 4518 | 3388k |161k
**Total**| 17,966|13,474,500 | 590,660
**Average**|1,122.875|842,156.25| 71,822.25
## Citation
If you use this dataset, please cite the paper [Authorship Attribution in Bangla literature using Character-level CNN](https://ieeexplore.ieee.org/abstract/document/9038560/). [Archive link](https://arxiv.org/abs/2001.05316).
```
@inproceedings{BAAD16Dataset,
title={Authorship Attribution in Bangla literature using Character-level CNN},
author={Khatun, Aisha and Rahman, Anisur and Islam, Md Saiful and others},
booktitle={2019 22nd International Conference on Computer and Information Technology (ICCIT)},
pages={1--5},
year={2019},
organization={IEEE}
doi={10.1109/ICCIT48885.2019.9038560}
}
```
This dataset is also available in Mendeley: [BAAD16 dataset](https://data.mendeley.com/datasets/6d9jrkgtvv/4). Always make sure to use the latest version of the dataset. Cite the dataset directly by:
```
@misc{BAAD6Dataset,
author = {Khatun, Aisha and Rahman, Anisur and Islam, Md. Saiful},
title = {BAAD16: Bangla Authorship Attribution Dataset},
year={2019},
doi = {10.17632/6d9jrkgtvv.4},
howpublished= {\url{https://data.mendeley.com/datasets/6d9jrkgtvv/4}}
}
```
提供机构:
Aisha
原始信息汇总
数据集概述
名称: BAAD16: Bangla Authorship Attribution Dataset (16 Authors)
语言: 孟加拉语(bn)
许可证: CC-BY-4.0
多语言性: 单语
数据来源: 原始数据
任务类别: 文本分类
任务ID: 多类分类
数据集描述
BAAD16 是一个针对孟加拉文学的作者归属数据集。该数据集通过定制的网络爬虫从在线孟加拉电子图书馆抓取文本创建,包含16位著名孟加拉作家的文学作品,包括小说、故事、系列等。每个样本文档包含750字。数据集不平衡,更接近真实世界情况,其中并非所有作者都有大量样本文本。
作者详情
| 作者名称 | 样本数量 | 字数 | 独特词汇数 |
|---|---|---|---|
| zahir rayhan | 185 | 138k | 20k |
| nazrul | 223 | 167k | 33k |
| manik bandhopaddhay | 469 | 351k | 44k |
| nihar ronjon gupta | 476 | 357k | 43k |
| bongkim | 562 | 421k | 62k |
| tarashonkor | 775 | 581k | 84k |
| shottojit roy | 849 | 636k | 67k |
| shordindu | 888 | 666k | 84k |
| toslima nasrin | 931 | 698k | 76k |
| shirshendu | 1048 | 786k | 69k |
| zafar iqbal | 1100 | 825k | 53k |
| robindronath | 1259 | 944k | 89k |
| shorotchandra | 1312 | 984k | 78k |
| shomresh | 1408 | 1056k | 69k |
| shunil gongopaddhay | 1963 | 1472k | 109k |
| humayun ahmed | 4518 | 3388k | 161k |
| 总计 | 17,966 | 13,474,500 | 590,660 |
| 平均 | 1,122.875 | 842,156.25 | 71,822.25 |
引用信息
若使用此数据集,请引用以下论文:
@inproceedings{BAAD16Dataset, title={Authorship Attribution in Bangla literature using Character-level CNN}, author={Khatun, Aisha and Rahman, Anisur and Islam, Md Saiful and others}, booktitle={2019 22nd International Conference on Computer and Information Technology (ICCIT)}, pages={1--5}, year={2019}, organization={IEEE} doi={10.1109/ICCIT48885.2019.9038560} }



