KvaytG/russian-telegram-chat-logs

Name: KvaytG/russian-telegram-chat-logs
Creator: KvaytG
Published: 2026-04-21 15:01:40
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/KvaytG/russian-telegram-chat-logs

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - ru tags: - telegram - nlp - text-cleaning - semantic-analysis - russian - conversation size_categories: - 1K<n<10K dataset_info: features: - name: message dtype: string - name: semantic_score dtype: float32 splits: - name: train num_examples: 6621 --- # russian-telegram-chat-logs This dataset contains messages extracted from Telegram chat history, processed and ranked by their "semantic load" (information density). ## Dataset Overview The data is stored in **Parquet** format, which provides efficient storage and maintains data types. Each row represents a single message that has passed through several stages of cleaning and analysis. | Column | Type | Description | |:-------------------|:--------|:---------------------------------------------------------------------------------| | **message** | string | The cleaned text of the Telegram message (Cyrillic-only). | | **semantic_score** | float32 | A normalized value (0.0 - 100.0) representing the message's information density. | ## Processing Pipeline To ensure high data quality, the following steps were performed: 1. **Extraction**: Raw message logs were parsed from Telegram export files. 2. **Filtering**: * Removed all messages that did not contain Russian (Cyrillic) characters. * Removed messages consisting solely of emojis, special characters, or links. 3. **Cleaning**: * Stripped emojis and redundant symbols. * Normalized whitespace and removed system metadata (timestamps, sender names). 4. **Deduplication**: Identical messages were removed to ensure every entry in the dataset is unique. ## How Semantic Score is Calculated The `semantic_score` is not just a measure of length, but a representation of **Semantic Density**. The calculation involves: 1. **Vectorization**: Each message is converted into a high-dimensional vector (embedding) using the `paraphrase-multilingual-MiniLM-L12-v2` Sentence-Transformer model. This model understands context and semantic relationships between words. 2. **L2-Norm Calculation**: We calculate the magnitude (norm) of the embedding vector. Complex and unique sentences typically result in higher vector norms. 3. **Length Weighting**: To balance the score, we apply a logarithmic weight based on the character length of the message. This prevents long, repetitive sentences from dominating while ensuring that very short phrases (like "Ok") receive lower scores. 4. **Min-Max Scaling**: The final raw values are normalized to a **0 to 100%** scale: * **100.0**: The most semantically dense message in the dataset. * **0.0**: The message with the least information density (e.g., simple interjections). ## Usage ```python from datasets import load_dataset dataset = load_dataset("KvaytG/russian-telegram-chat-logs", split="train") ``` ## License This dataset is released under the **Apache License 2.0**. ## Citation ```bibtex @misc{kvaytg_russian_telegram_chat_logs, author = {KvaytG}, title = {Russian Telegram Chat Logs: A Semantically Ranked Dataset}, year = {2026}, publisher = {Hugging Face}, journal = {Hugging Face Datasets}, url = {https://huggingface.co/datasets/KvaytG/russian-telegram-chat-logs}, note = {Processed Telegram messages with semantic density scoring using paraphrase-multilingual-MiniLM-L12-v2} } ```

提供机构：

KvaytG

5,000+

优质数据集

54 个

任务类型

进入经典数据集