KyivNotKiev/corpus

Name: KyivNotKiev/corpus
Creator: KyivNotKiev
Published: 2026-04-05 05:52:05
License: 暂无描述

Hugging Face2026-04-05 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/KyivNotKiev/corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-4.0 task_categories: - text-classification task_ids: - topic-classification - sentiment-classification tags: - linguistics - ukraine - toponyms - language-policy - kyivnotkiev size_categories: - 10K<n<100K --- # KyivNotKiev Computational Linguistics Corpus A balanced, labeled corpus of texts containing Ukrainian and Russian toponym variants (e.g., "Kyiv" vs "Kiev"), annotated with context categories and sentiment. ## Dataset Description - **Curated by:** Ivan Dobrovolskyi - **Language:** Primarily English - **License:** CC-BY 4.0 - **Paper:** #KyivNotKiev: A Large-Scale Computational Study of Ukrainian Toponym Adoption (forthcoming) - **Website:** https://kyivnotkiev.org ## Dataset Summary 29,938 texts across 55 Ukrainian-Russian toponym pairs from 4 sources (Reddit, YouTube, GDELT news articles). Each text is labeled with: - **Context category**: politics, war_conflict, sports, culture_arts, food_cuisine, travel_tourism, academic_science, history, business_economy, general_news - **Sentiment**: positive, neutral, negative - **Variant**: which toponym form (russian/ukrainian) appears in the text ## Intended Uses - Studying language policy adoption in media and social platforms - Training toponym context classifiers - Analyzing sentiment differences between spelling variants - Cross-source and temporal analysis of naming conventions ## Dataset Structure ### Data Fields - `pair_id`: Integer ID of the toponym pair - `text`: The full text content - `variant`: "russian" or "ukrainian" — which spelling form appears - `source`: Data source (reddit, youtube, gdelt) - `year`: Publication year - `context_label`: Annotated context category - `context_confidence`: Annotation confidence (0-1) - `sentiment_label`: Sentiment annotation - `sentiment_score`: Sentiment score (-1 to 1) - `word_count`: Number of words in text - `matched_term`: The specific toponym form found in text ### Splits | Split | Count | |-------|-------| | train | 23,950 | | validation | 2,993 | | test | 2,993 | ## Balance Report See `balance_report.json` for detailed per-pair, per-source, per-variant distributions and documented shortfalls. ## Collection Methodology 1. **Reddit**: Titles and bodies from Arctic Shift API + Reddit search (2010-2026) 2. **YouTube**: Video titles and descriptions via yt-dlp (2010-2026) 3. **GDELT**: News article bodies fetched from URLs using trafilatura (2010-2026) 4. **Balancing**: Stratified sampling by pair × source × variant × year stratum 5. **Annotation**: Llama 3.1 70B-Instruct with human validation on 200 random samples 6. **Fetch transparency**: All GDELT URL fetch attempts logged in `fetch_log.parquet` ## Citation ```bibtex @article{dobrovolskyi2026kyivnotkiev, title={#KyivNotKiev: A Large-Scale Computational Study of Ukrainian Toponym Adoption}, author={Dobrovolskyi, Ivan}, year={2026} } ```

language: - 英语 license: CC-BY 4.0 task_categories: - 文本分类 task_ids: - 主题分类 - 情感分类 tags: - 语言学 - 乌克兰 - 地名（toponyms） - 语言政策 - KyivNotKiev size_categories: - 10000 < 样本量 < 100000 # KyivNotKiev 计算语言学语料库本语料库为经过标注的平衡语料库，收录包含乌克兰语与俄语地名（toponyms）变体（例如"Kyiv"与"Kiev"）的文本，并为每条文本标注了语境类别与情感倾向。 ## 数据集说明 - **整理者**：伊万·多布罗沃尔斯基（Ivan Dobrovolskyi） - **语言**：以英语为主 - **许可协议**：CC-BY 4.0 - **相关论文**：《#KyivNotKiev：乌克兰地名采纳的大规模计算研究》（待发表） - **官方网站**：https://kyivnotkiev.org ## 数据集概览本语料库包含来自4个数据源（Reddit、YouTube、GDELT新闻文章）的29938条文本，覆盖55组乌克兰语-俄语地名变体。每条文本均标注以下信息： - **语境类别**：政治、战争冲突、体育、文化艺术、餐饮烹饪、旅游观光、学术科学、历史、商业经济、综合新闻 - **情感倾向**：积极、中性、消极 - **变体类型**：文本中使用的地名拼写形式（俄语/乌克兰语） ## 预期用途 - 研究媒体与社交平台中的语言政策采纳情况 - 训练地名语境分类器 - 分析不同拼写变体之间的情感差异 - 开展跨数据源与跨时间维度的命名惯例分析 ## 数据集结构 ### 数据字段 - `pair_id`：地名变体组的整数编号 - `text`：完整文本内容 - `variant`：文本中使用的拼写形式，取值为"russian"（俄语拼写）或"ukrainian"（乌克兰语拼写） - `source`：数据来源（reddit、youtube、gdelt） - `year`：发布年份 - `context_label`：标注的语境类别 - `context_confidence`：标注置信度（取值范围0至1） - `sentiment_label`：情感标注结果 - `sentiment_score`：情感得分（取值范围-1至1） - `word_count`：文本的单词数量 - `matched_term`：文本中匹配到的具体地名变体形式 ### 数据集拆分 | 拆分方式 | 样本数量 | |----------|----------| | 训练集 | 23950 | | 验证集 | 2993 | | 测试集 | 2993 | ## 平衡性报告详细的各组、各数据源、各变体的分布情况以及已知的样本缺口，请参阅`balance_report.json`文件。 ## 数据采集方法 1. **Reddit**：通过Arctic Shift API与Reddit搜索获取标题与正文内容（2010-2026年） 2. **YouTube**：通过yt-dlp工具获取视频标题与描述信息（2010-2026年） 3. **GDELT**：通过trafilatura工具从指定URL抓取新闻文章正文（2010-2026年） 4. **数据平衡**：以地名变体组×数据源×变体类型×年份作为分层单元，进行分层抽样 5. **标注流程**：使用Llama 3.1 70B-Instruct模型进行自动标注，并对200个随机抽取的样本进行人工校验 6. **抓取透明度**：所有GDELT URL的抓取尝试均已记录在`fetch_log.parquet`文件中 ## 引用 bibtex @article{dobrovolskyi2026kyivnotkiev, title={#KyivNotKiev: A Large-Scale Computational Study of Ukrainian Toponym Adoption}, author={Dobrovolskyi, Ivan}, year={2026} }

提供机构：

KyivNotKiev

5,000+

优质数据集

54 个

任务类型

进入经典数据集