Bengali Idiom Detection: A BIO-Annotated Dataset for Figurative Language Processing
收藏DataCite Commons2026-04-27 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/dw7gpmz39m/1
下载链接
链接失效反馈官方服务:
资源简介:
This dataset includes a corpus of the Bengali language for idiom identification and sequence tagging, in which idioms are identified using token-based BIO tagging. There are 30,100 sentences in this dataset, which have been chosen such that there are both idiomatic and non-idiomatic uses in equal numbers. There are 20-25 sentences per idiom included, which represent various usage contexts of each idiom. For each idiom:
60% of the sentences contain the idiomatic usage of the expression
40% represent literal or non-idiomatic usage, designed as hard negative samples
This kind of dataset helps machine learning algorithms find small differences between how idioms are used literally and figuratively.
Data Structure :
The following JSON structure is used for every row in the dataset:
A unique ID for each instance
A Bengali sentence containing idiomatic or literal expressions
Tokenized representation of the sentence
BIO-formatted labels marking idiomatic spans (B-IDIOM, I-IDIOM, O)
A binary label indicating the presence of an idiom in the sentence
Example format:
{"id":3310,"sentence":"মহান লেখকের মৃত্যু আমাদের জন্য এক ইন্দ্রপতন","tokens":["মহান","লেখকের","মৃত্যু","আমাদের","জন্য","এক","ইন্দ্রপতন"],"labels":["O","O","O","O","O","O","B-IDIOM"],"is_idiom":1}
In order to promote linguistic diversity and strength, each idiom appears in several versions of a sentence with differing syntax and contexts.
This corpus is suitable for sequence labeling of tokens, sentence classification, Bengali NLP with transformers, and figurative language studies in low-resource environments.
Bengali Idiom Lexicon :
This database comprises of carefully chosen collection of nearly 1,300 Bengali idioms along with their explanations. Each record consists of an exclusive code number, the idiom and its explanation in Bangla language.
Example:
{"id":1290,"idiom":"হাতে আকাশ পাওয়া","meaning":"অভাবিতভাবে কিছু পাওয়া"}
All data is provided in UTF-8 JSON Lines (.jsonl) format for easy NLP integration.
提供机构:
Mendeley Data
创建时间:
2026-04-27



