KyivNotKiev/corpus
收藏Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/KyivNotKiev/corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
task_categories:
- text-classification
task_ids:
- topic-classification
- sentiment-classification
tags:
- linguistics
- ukraine
- toponyms
- language-policy
- kyivnotkiev
size_categories:
- 10K<n<100K
---
# KyivNotKiev Computational Linguistics Corpus
A balanced, labeled corpus of texts containing Ukrainian and Russian toponym variants
(e.g., "Kyiv" vs "Kiev"), annotated with context categories and sentiment.
## Dataset Description
- **Curated by:** Ivan Dobrovolskyi
- **Language:** Primarily English
- **License:** CC-BY 4.0
- **Paper:** #KyivNotKiev: A Large-Scale Computational Study of Ukrainian Toponym Adoption (forthcoming)
- **Website:** https://kyivnotkiev.org
## Dataset Summary
29,938 texts across 55 Ukrainian-Russian toponym pairs from 4 sources
(Reddit, YouTube, GDELT news articles). Each text is labeled with:
- **Context category**: politics, war_conflict, sports, culture_arts, food_cuisine, travel_tourism, academic_science, history, business_economy, general_news
- **Sentiment**: positive, neutral, negative
- **Variant**: which toponym form (russian/ukrainian) appears in the text
## Intended Uses
- Studying language policy adoption in media and social platforms
- Training toponym context classifiers
- Analyzing sentiment differences between spelling variants
- Cross-source and temporal analysis of naming conventions
## Dataset Structure
### Data Fields
- `pair_id`: Integer ID of the toponym pair
- `text`: The full text content
- `variant`: "russian" or "ukrainian" — which spelling form appears
- `source`: Data source (reddit, youtube, gdelt)
- `year`: Publication year
- `context_label`: Annotated context category
- `context_confidence`: Annotation confidence (0-1)
- `sentiment_label`: Sentiment annotation
- `sentiment_score`: Sentiment score (-1 to 1)
- `word_count`: Number of words in text
- `matched_term`: The specific toponym form found in text
### Splits
| Split | Count |
|-------|-------|
| train | 23,950 |
| validation | 2,993 |
| test | 2,993 |
## Balance Report
See `balance_report.json` for detailed per-pair, per-source, per-variant distributions
and documented shortfalls.
## Collection Methodology
1. **Reddit**: Titles and bodies from Arctic Shift API + Reddit search (2010-2026)
2. **YouTube**: Video titles and descriptions via yt-dlp (2010-2026)
3. **GDELT**: News article bodies fetched from URLs using trafilatura (2010-2026)
4. **Balancing**: Stratified sampling by pair × source × variant × year stratum
5. **Annotation**: Llama 3.1 70B-Instruct with human validation on 200 random samples
6. **Fetch transparency**: All GDELT URL fetch attempts logged in `fetch_log.parquet`
## Citation
```bibtex
@article{dobrovolskyi2026kyivnotkiev,
title={#KyivNotKiev: A Large-Scale Computational Study of Ukrainian Toponym Adoption},
author={Dobrovolskyi, Ivan},
year={2026}
}
```
language:
- 英语
license: CC-BY 4.0
task_categories:
- 文本分类
task_ids:
- 主题分类
- 情感分类
tags:
- 语言学
- 乌克兰
- 地名(toponyms)
- 语言政策
- KyivNotKiev
size_categories:
- 10000 < 样本量 < 100000
# KyivNotKiev 计算语言学语料库
本语料库为经过标注的平衡语料库,收录包含乌克兰语与俄语地名(toponyms)变体(例如"Kyiv"与"Kiev")的文本,并为每条文本标注了语境类别与情感倾向。
## 数据集说明
- **整理者**:伊万·多布罗沃尔斯基(Ivan Dobrovolskyi)
- **语言**:以英语为主
- **许可协议**:CC-BY 4.0
- **相关论文**:《#KyivNotKiev:乌克兰地名采纳的大规模计算研究》(待发表)
- **官方网站**:https://kyivnotkiev.org
## 数据集概览
本语料库包含来自4个数据源(Reddit、YouTube、GDELT新闻文章)的29938条文本,覆盖55组乌克兰语-俄语地名变体。每条文本均标注以下信息:
- **语境类别**:政治、战争冲突、体育、文化艺术、餐饮烹饪、旅游观光、学术科学、历史、商业经济、综合新闻
- **情感倾向**:积极、中性、消极
- **变体类型**:文本中使用的地名拼写形式(俄语/乌克兰语)
## 预期用途
- 研究媒体与社交平台中的语言政策采纳情况
- 训练地名语境分类器
- 分析不同拼写变体之间的情感差异
- 开展跨数据源与跨时间维度的命名惯例分析
## 数据集结构
### 数据字段
- `pair_id`:地名变体组的整数编号
- `text`:完整文本内容
- `variant`:文本中使用的拼写形式,取值为"russian"(俄语拼写)或"ukrainian"(乌克兰语拼写)
- `source`:数据来源(reddit、youtube、gdelt)
- `year`:发布年份
- `context_label`:标注的语境类别
- `context_confidence`:标注置信度(取值范围0至1)
- `sentiment_label`:情感标注结果
- `sentiment_score`:情感得分(取值范围-1至1)
- `word_count`:文本的单词数量
- `matched_term`:文本中匹配到的具体地名变体形式
### 数据集拆分
| 拆分方式 | 样本数量 |
|----------|----------|
| 训练集 | 23950 |
| 验证集 | 2993 |
| 测试集 | 2993 |
## 平衡性报告
详细的各组、各数据源、各变体的分布情况以及已知的样本缺口,请参阅`balance_report.json`文件。
## 数据采集方法
1. **Reddit**:通过Arctic Shift API与Reddit搜索获取标题与正文内容(2010-2026年)
2. **YouTube**:通过yt-dlp工具获取视频标题与描述信息(2010-2026年)
3. **GDELT**:通过trafilatura工具从指定URL抓取新闻文章正文(2010-2026年)
4. **数据平衡**:以地名变体组×数据源×变体类型×年份作为分层单元,进行分层抽样
5. **标注流程**:使用Llama 3.1 70B-Instruct模型进行自动标注,并对200个随机抽取的样本进行人工校验
6. **抓取透明度**:所有GDELT URL的抓取尝试均已记录在`fetch_log.parquet`文件中
## 引用
bibtex
@article{dobrovolskyi2026kyivnotkiev,
title={#KyivNotKiev: A Large-Scale Computational Study of Ukrainian Toponym Adoption},
author={Dobrovolskyi, Ivan},
year={2026}
}
提供机构:
KyivNotKiev



