five

Aisha/BAAD16

收藏
Hugging Face2022-10-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Aisha/BAAD16
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - found - crowdsourced - expert-generated language_creators: - found - crowdsourced language: - bn license: - cc-by-4.0 multilinguality: - monolingual pretty_name: 'BAAD16: Bangla Authorship Attribution Dataset (16 Authors)' source_datasets: - original task_categories: - text-classification task_ids: - multi-class-classification --- ## Description **BAAD16** is an **Authorship Attribution dataset for Bengali Literature**. It was collected and analyzed by the authors of [this paper](https://arxiv.org/abs/2001.05316). It was created by scraping text from an online Bangla e-library using custom web crawler and contains literary works of various famous Bangla writers. It contains novels, stories, series, and other works of 16 authors. Each sample document is created with 750 words. The dataset is imbalanced and resembles real-world scenarios more closely, where not all the authors will have a large number of sample texts. The following table gives more details about the dataset. | Author Name | Number of Samples | Word Count | Unique Word | --- | --- | --- | --- | | zahir rayhan | 185 | 138k | 20k |nazrul | 223 | 167k | 33k |manik bandhopaddhay | 469 | 351k | 44k |nihar ronjon gupta | 476 | 357k | 43k |bongkim | 562 | 421k | 62k |tarashonkor | 775 | 581k | 84k |shottojit roy | 849 | 636k | 67k |shordindu | 888 | 666k | 84k |toslima nasrin | 931 | 698k | 76k |shirshendu | 1048 | 786k | 69k |zafar iqbal | 1100 | 825k | 53k |robindronath | 1259 | 944k | 89k |shorotchandra | 1312 | 984k | 78k |shomresh | 1408 | 1056k|69k |shunil gongopaddhay | 1963 | 1472k|109k |humayun ahmed | 4518 | 3388k |161k **Total**| 17,966|13,474,500 | 590,660 **Average**|1,122.875|842,156.25| 71,822.25 ## Citation If you use this dataset, please cite the paper [Authorship Attribution in Bangla literature using Character-level CNN](https://ieeexplore.ieee.org/abstract/document/9038560/). [Archive link](https://arxiv.org/abs/2001.05316). ``` @inproceedings{BAAD16Dataset, title={Authorship Attribution in Bangla literature using Character-level CNN}, author={Khatun, Aisha and Rahman, Anisur and Islam, Md Saiful and others}, booktitle={2019 22nd International Conference on Computer and Information Technology (ICCIT)}, pages={1--5}, year={2019}, organization={IEEE} doi={10.1109/ICCIT48885.2019.9038560} } ``` This dataset is also available in Mendeley: [BAAD16 dataset](https://data.mendeley.com/datasets/6d9jrkgtvv/4). Always make sure to use the latest version of the dataset. Cite the dataset directly by: ``` @misc{BAAD6Dataset, author = {Khatun, Aisha and Rahman, Anisur and Islam, Md. Saiful}, title = {BAAD16: Bangla Authorship Attribution Dataset}, year={2019}, doi = {10.17632/6d9jrkgtvv.4}, howpublished= {\url{https://data.mendeley.com/datasets/6d9jrkgtvv/4}} } ```
提供机构:
Aisha
原始信息汇总

数据集概述

名称: BAAD16: Bangla Authorship Attribution Dataset (16 Authors)

语言: 孟加拉语(bn)

许可证: CC-BY-4.0

多语言性: 单语

数据来源: 原始数据

任务类别: 文本分类

任务ID: 多类分类

数据集描述

BAAD16 是一个针对孟加拉文学的作者归属数据集。该数据集通过定制的网络爬虫从在线孟加拉电子图书馆抓取文本创建,包含16位著名孟加拉作家的文学作品,包括小说、故事、系列等。每个样本文档包含750字。数据集不平衡,更接近真实世界情况,其中并非所有作者都有大量样本文本。

作者详情

作者名称 样本数量 字数 独特词汇数
zahir rayhan 185 138k 20k
nazrul 223 167k 33k
manik bandhopaddhay 469 351k 44k
nihar ronjon gupta 476 357k 43k
bongkim 562 421k 62k
tarashonkor 775 581k 84k
shottojit roy 849 636k 67k
shordindu 888 666k 84k
toslima nasrin 931 698k 76k
shirshendu 1048 786k 69k
zafar iqbal 1100 825k 53k
robindronath 1259 944k 89k
shorotchandra 1312 984k 78k
shomresh 1408 1056k 69k
shunil gongopaddhay 1963 1472k 109k
humayun ahmed 4518 3388k 161k
总计 17,966 13,474,500 590,660
平均 1,122.875 842,156.25 71,822.25

引用信息

若使用此数据集,请引用以下论文:

@inproceedings{BAAD16Dataset, title={Authorship Attribution in Bangla literature using Character-level CNN}, author={Khatun, Aisha and Rahman, Anisur and Islam, Md Saiful and others}, booktitle={2019 22nd International Conference on Computer and Information Technology (ICCIT)}, pages={1--5}, year={2019}, organization={IEEE} doi={10.1109/ICCIT48885.2019.9038560} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作