five

Bengali Idiom Detection: A BIO-Annotated Dataset for Figurative Language Processing

收藏
DataCite Commons2026-04-27 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/dw7gpmz39m/1
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset includes a corpus of the Bengali language for idiom identification and sequence tagging, in which idioms are identified using token-based BIO tagging. There are 30,100 sentences in this dataset, which have been chosen such that there are both idiomatic and non-idiomatic uses in equal numbers. There are 20-25 sentences per idiom included, which represent various usage contexts of each idiom. For each idiom: 60% of the sentences contain the idiomatic usage of the expression 40% represent literal or non-idiomatic usage, designed as hard negative samples This kind of dataset helps machine learning algorithms find small differences between how idioms are used literally and figuratively. Data Structure : The following JSON structure is used for every row in the dataset: A unique ID for each instance A Bengali sentence containing idiomatic or literal expressions Tokenized representation of the sentence BIO-formatted labels marking idiomatic spans (B-IDIOM, I-IDIOM, O) A binary label indicating the presence of an idiom in the sentence Example format: {"id":3310,"sentence":"মহান লেখকের মৃত্যু আমাদের জন্য এক ইন্দ্রপতন","tokens":["মহান","লেখকের","মৃত্যু","আমাদের","জন্য","এক","ইন্দ্রপতন"],"labels":["O","O","O","O","O","O","B-IDIOM"],"is_idiom":1} In order to promote linguistic diversity and strength, each idiom appears in several versions of a sentence with differing syntax and contexts. This corpus is suitable for sequence labeling of tokens, sentence classification, Bengali NLP with transformers, and figurative language studies in low-resource environments. Bengali Idiom Lexicon : This database comprises of carefully chosen collection of nearly 1,300 Bengali idioms along with their explanations. Each record consists of an exclusive code number, the idiom and its explanation in Bangla language. Example: {"id":1290,"idiom":"হাতে আকাশ পাওয়া","meaning":"অভাবিতভাবে কিছু পাওয়া"} All data is provided in UTF-8 JSON Lines (.jsonl) format for easy NLP integration.
提供机构:
Mendeley Data
创建时间:
2026-04-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作