five

KvaytG/russian-telegram-chat-logs

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/KvaytG/russian-telegram-chat-logs
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - ru tags: - telegram - nlp - text-cleaning - semantic-analysis - russian - conversation size_categories: - 1K<n<10K dataset_info: features: - name: message dtype: string - name: semantic_score dtype: float32 splits: - name: train num_examples: 6621 --- # russian-telegram-chat-logs This dataset contains messages extracted from Telegram chat history, processed and ranked by their "semantic load" (information density). ## Dataset Overview The data is stored in **Parquet** format, which provides efficient storage and maintains data types. Each row represents a single message that has passed through several stages of cleaning and analysis. | Column | Type | Description | |:-------------------|:--------|:---------------------------------------------------------------------------------| | **message** | string | The cleaned text of the Telegram message (Cyrillic-only). | | **semantic_score** | float32 | A normalized value (0.0 - 100.0) representing the message's information density. | ## Processing Pipeline To ensure high data quality, the following steps were performed: 1. **Extraction**: Raw message logs were parsed from Telegram export files. 2. **Filtering**: * Removed all messages that did not contain Russian (Cyrillic) characters. * Removed messages consisting solely of emojis, special characters, or links. 3. **Cleaning**: * Stripped emojis and redundant symbols. * Normalized whitespace and removed system metadata (timestamps, sender names). 4. **Deduplication**: Identical messages were removed to ensure every entry in the dataset is unique. ## How Semantic Score is Calculated The `semantic_score` is not just a measure of length, but a representation of **Semantic Density**. The calculation involves: 1. **Vectorization**: Each message is converted into a high-dimensional vector (embedding) using the `paraphrase-multilingual-MiniLM-L12-v2` Sentence-Transformer model. This model understands context and semantic relationships between words. 2. **L2-Norm Calculation**: We calculate the magnitude (norm) of the embedding vector. Complex and unique sentences typically result in higher vector norms. 3. **Length Weighting**: To balance the score, we apply a logarithmic weight based on the character length of the message. This prevents long, repetitive sentences from dominating while ensuring that very short phrases (like "Ok") receive lower scores. 4. **Min-Max Scaling**: The final raw values are normalized to a **0 to 100%** scale: * **100.0**: The most semantically dense message in the dataset. * **0.0**: The message with the least information density (e.g., simple interjections). ## Usage ```python from datasets import load_dataset dataset = load_dataset("KvaytG/russian-telegram-chat-logs", split="train") ``` ## License This dataset is released under the **Apache License 2.0**. ## Citation ```bibtex @misc{kvaytg_russian_telegram_chat_logs, author = {KvaytG}, title = {Russian Telegram Chat Logs: A Semantically Ranked Dataset}, year = {2026}, publisher = {Hugging Face}, journal = {Hugging Face Datasets}, url = {https://huggingface.co/datasets/KvaytG/russian-telegram-chat-logs}, note = {Processed Telegram messages with semantic density scoring using paraphrase-multilingual-MiniLM-L12-v2} } ```
提供机构:
KvaytG
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作