bruhwalkk/economic-telegram-news-corpus-2025
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/bruhwalkk/economic-telegram-news-corpus-2025
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- ru
size_categories:
- 10K<n<100K
tags:
- economics
- telegram
- narratives
- russian
- news
- social-media
- virality
- nlp
- text-classification
task_categories:
- text-classification
pretty_name: Economic Telegram News Corpus 2025
---
# Economic Telegram News Corpus 2025
A corpus of **31,292 Russian-language economic news posts** collected from 7 major Telegram channels, spanning January 2024 to September 2025. The dataset supports research on economic narrative detection, topic classification, and information diffusion in social media.
## Associated Paper
> **Going Viral: LLM-Based Modeling of Economic Narratives**
## Dataset Description
The raw collection contains 123,273 posts. The economic corpus was constructed by:
1. Removing duplicates and near-duplicates
2. Excluding non-news content
3. Selecting posts assigned to economic topics via an LLM-based classifier (~90% accuracy on the Golden Set)
### Virality Score
Each post includes a composite virality score (`viral_final`) computed over a 3-day window after publication:
```
viral_final = 0.45 * viral_static + 0.20 * viral_dynamic + 0.35 * viral_ml
```
## Columns
| Column | Type | Description |
|--------|------|-------------|
| `message_id` | string (UUID) | Unique message identifier |
| `id_channel` | int | Channel ID (1–7) |
| `message` | string | Full post text (Russian) |
| `viral_final` | float | Composite virality score [0, 1] |
| `is_economic` | bool | Economic post flag |
| `economic_topic` | string | LLM-assigned economic topic (9 categories) |
| `topic_confidence` | float | Topic classification confidence |
| `channel_name` | string | Telegram channel name |
| `channel_w` | float | Channel weight |
| `message_vector` | string | PostgreSQL tsvector (full-text search index) |
| `subscribers` | int | Channel subscriber count |
| `date` | datetime | Exact publication timestamp (UTC) |
| `date_day` | datetime | Publication date (day-level, UTC) |
## Channels (7)
| Channel | Posts |
|---------|-------|
| Forbes Russia | — |
| Блумберг | — |
| РИА Новости | — |
| Экономика | — |
| Раньше всех. Ну почти | — |
| Банки, деньги, два офшора | — |
| Сигналы РЦБ | — |
## Topics (9)
| Topic | Count |
|-------|-------|
| Государственная экономическая политика | 9,040 |
| Корпоративные финансы | 4,572 |
| Макроэкономика | 3,978 |
| Санкции и геополитика | 3,582 |
| Рынки капитала | 3,169 |
| Сырьевые рынки | 2,118 |
| Международная торговля | 1,825 |
| Другое | 1,608 |
| Валютный рынок | 1,400 |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("bruhwalkk/economic-telegram-news-corpus-2025")
```
## Citation
If you use this dataset, please cite the associated paper:
```bibtex
@article{economic_narratives_2025,
title={Going Viral: LLM-Based Modeling of Economic Narratives},
journal={Записки научных семинаров ПОМИ},
year={2025}
}
```
## License
CC-BY-NC-4.0
license: CC-BY-NC-4.0
language:
- ru
size_categories:
- 10000 < 样本数 < 100000
tags:
- 经济学(Economics)
- Telegram
- 叙事(Narratives)
- 俄语(Russian)
- 新闻(News)
- 社交媒体(Social Media)
- 传播力(Virality)
- 自然语言处理(Natural Language Processing,NLP)
- 文本分类(Text Classification)
task_categories:
- 文本分类(Text Classification)
pretty_name: 2025年经济Telegram新闻语料库
# 2025年经济Telegram新闻语料库
本语料库包含**31292条俄语经济新闻帖文**,采集自7个主流Telegram频道,时间跨度为2024年1月至2025年9月。本数据集可支撑经济叙事识别、主题分类以及社交媒体信息传播相关研究。
## 关联论文
> **《病毒式传播:基于大语言模型(Large Language Model,LLM)的经济叙事建模》**
## 数据集说明
原始采集数据集共包含123273条帖文。本经济语料库通过以下步骤构建:
1. 移除重复及近重复帖文
2. 剔除非新闻类内容
3. 通过基于大语言模型的分类器筛选出被标注为经济主题的帖文(该分类器在黄金数据集上的准确率约为90%)
### 传播力得分
每条帖文均包含一项综合传播力得分(`viral_final`),该得分基于帖文发布后3天内的指标计算得出,计算公式如下:
viral_final = 0.45 * viral_static + 0.20 * viral_dynamic + 0.35 * viral_ml
## 字段说明
| 字段名 | 数据类型 | 字段描述 |
|-------|----------|----------|
| `message_id` | 字符串(UUID格式) | 唯一帖文标识符 |
| `id_channel` | 整数 | Telegram频道ID(取值范围1至7) |
| `message` | 字符串 | 完整帖文文本(俄语) |
| `viral_final` | 浮点数 | 综合传播力得分,取值范围为[0, 1] |
| `is_economic` | 布尔值 | 经济帖文标记 |
| `economic_topic` | 字符串 | 大语言模型标注的经济主题(共9个类别) |
| `topic_confidence` | 浮点数 | 主题分类置信度 |
| `channel_name` | 字符串 | Telegram频道名称 |
| `channel_w` | 浮点数 | 频道权重 |
| `message_vector` | 字符串 | PostgreSQL tsvector 向量(全文搜索索引) |
| `subscribers` | 整数 | 频道订阅者数量 |
| `date` | 日期时间 | 精确发布时间戳(UTC时区) |
| `date_day` | 日期时间 | 发布日期(按天粒度,UTC时区) |
## 7个Telegram频道
| 频道名 | 帖文数量 |
|-------|----------|
| 俄罗斯福布斯(Forbes Russia) | - |
| Блумберг | - |
| РИА Новости | - |
| Экономика | - |
| Раньше всех. Ну почти | - |
| Банки, деньги, два офшора | - |
| Сигналы РЦБ | - |
## 9个经济主题
| 主题名称 | 帖文数量 |
|---------|----------|
| 国家经济政策(Государственная экономическая политика) | 9040 |
| 企业金融(Корпоративные финансы) | 4572 |
| 宏观经济(Макроэкономика) | 3978 |
| 制裁与地缘政治(Санкции и геополитика) | 3582 |
| 资本市场(Рынки капитала) | 3169 |
| 大宗商品市场(Сырьевые рынки) | 2118 |
| 国际贸易(Международная торговля) | 1825 |
| 其他(Другое) | 1608 |
| 外汇市场(Валютный рынок) | 1400 |
## 使用方式
python
from datasets import load_dataset
ds = load_dataset("bruhwalkk/economic-telegram-news-corpus-2025")
## 引用格式
若使用本数据集,请引用关联论文:
bibtex
@article{economic_narratives_2025,
title={Going Viral: LLM-Based Modeling of Economic Narratives},
journal={Записки научных семинаров ПОМИ},
year={2025}
}
## 许可证
CC-BY-NC-4.0
提供机构:
bruhwalkk



