five

Bangladesh in Transition: A Large-Scale Dataset of 55,000+ English News Articles from Five Major National Newspapers (2024–2025)

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/rrkrtvxmvx
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset comprises 55,368 English-language news articles from five major Bangladeshi newspapers (The Daily Star, Bangladesh Pratidin, Prothom Alo, The Bangladesh Today, Daily Sun) that include the political transition period of June 2024 to June 2025. Harvested with a custom Scrapy spider utilizing a hybrid extraction method (JSON-LD/HTML), the corpus tackles the shortage of South Asian English data for NLP and LLM training. Research Motivation & Findings The collection offers empirical investigation of media narratives and linguistic alterations amid political turbulence. Analysis finds large temporal changes, with publishing peaks in November 2024 (2,044 articles) and May 2025 (2,113 articles) corresponding with major political events. Dominant themes include "Government," "Adviser," "Yunus," and "Reform." Data Characteristics Volume: 55,368 articles (Avg. length: 337 words). Top Sources: The Daily Star (31,292) and Bangladesh Pratidin (18,375). Methodology: High-fidelity metadata extraction (authors, timestamps) via API/JSON-LD parsing. Files & Usage The data is provided in two distinct states to support diverse research needs: Raw Data Component (Archive State) Files: news_articles.db (SQLite), news_articles_raw.csv, news_articles_raw.xlsx Content: Original 55,368 records including web artifacts and potential duplicates. Usage: Source verification, reproducibility studies, and duplicate detection research. Processed Data Component (NLP-Ready State) Files: news_articles_clean.csv, news_articles_clean.xlsx Content: 54,347 unique records (853 duplicates removed, nulls filtered). Usage: Optimized for Named Entity Recognition (NER), topic modeling, summarization, and sentiment analysis.
创建时间:
2026-01-14
二维码
社区交流群
二维码
科研交流群
商业服务