KvaytG/russian-telegram-chat-logs
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/KvaytG/russian-telegram-chat-logs
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- ru
tags:
- telegram
- nlp
- text-cleaning
- semantic-analysis
- russian
- conversation
size_categories:
- 1K<n<10K
dataset_info:
features:
- name: message
dtype: string
- name: semantic_score
dtype: float32
splits:
- name: train
num_examples: 6621
---
# russian-telegram-chat-logs
This dataset contains messages extracted from Telegram chat history, processed and ranked by their "semantic load" (information density).
## Dataset Overview
The data is stored in **Parquet** format, which provides efficient storage and maintains data types. Each row represents a single message that has passed through several stages of cleaning and analysis.
| Column | Type | Description |
|:-------------------|:--------|:---------------------------------------------------------------------------------|
| **message** | string | The cleaned text of the Telegram message (Cyrillic-only). |
| **semantic_score** | float32 | A normalized value (0.0 - 100.0) representing the message's information density. |
## Processing Pipeline
To ensure high data quality, the following steps were performed:
1. **Extraction**: Raw message logs were parsed from Telegram export files.
2. **Filtering**:
* Removed all messages that did not contain Russian (Cyrillic) characters.
* Removed messages consisting solely of emojis, special characters, or links.
3. **Cleaning**:
* Stripped emojis and redundant symbols.
* Normalized whitespace and removed system metadata (timestamps, sender names).
4. **Deduplication**: Identical messages were removed to ensure every entry in the dataset is unique.
## How Semantic Score is Calculated
The `semantic_score` is not just a measure of length, but a representation of **Semantic Density**. The calculation involves:
1. **Vectorization**: Each message is converted into a high-dimensional vector (embedding) using the `paraphrase-multilingual-MiniLM-L12-v2` Sentence-Transformer model. This model understands context and semantic relationships between words.
2. **L2-Norm Calculation**: We calculate the magnitude (norm) of the embedding vector. Complex and unique sentences typically result in higher vector norms.
3. **Length Weighting**: To balance the score, we apply a logarithmic weight based on the character length of the message. This prevents long, repetitive sentences from dominating while ensuring that very short phrases (like "Ok") receive lower scores.
4. **Min-Max Scaling**: The final raw values are normalized to a **0 to 100%** scale:
* **100.0**: The most semantically dense message in the dataset.
* **0.0**: The message with the least information density (e.g., simple interjections).
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("KvaytG/russian-telegram-chat-logs", split="train")
```
## License
This dataset is released under the **Apache License 2.0**.
## Citation
```bibtex
@misc{kvaytg_russian_telegram_chat_logs,
author = {KvaytG},
title = {Russian Telegram Chat Logs: A Semantically Ranked Dataset},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Datasets},
url = {https://huggingface.co/datasets/KvaytG/russian-telegram-chat-logs},
note = {Processed Telegram messages with semantic density scoring using paraphrase-multilingual-MiniLM-L12-v2}
}
```
提供机构:
KvaytG



