ScoutieAutoML/russian-news-telegram-dataset

Name: ScoutieAutoML/russian-news-telegram-dataset
Creator: ScoutieAutoML
Published: 2024-11-19 12:40:56
License: 暂无描述

Hugging Face2024-11-19 更新2025-04-26 收录

下载链接：

https://hf-mirror.com/datasets/ScoutieAutoML/russian-news-telegram-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - text-classification - feature-extraction language: - ru tags: - russia - media - news pretty_name: Russian News and Media Telegram Dataset size_categories: - 10K<n<100K --- ## Description in English: Dataset collected from 30 Russian-language Telegram news channels on the topic News and Media, collected and marked up automatically using the **Scoutie** data collection and marking [service](https://scoutie.ru/).\ Try Scoutie and collect the same or another dataset using the [link](https://scoutie.ru/). ## Dataset fields: **taskId** - task identifier in the Scouti service.\ **text** - main text.\ **url** - link to the publication.\ **sourceLink** - link to Telegram.\ **subSourceLink** - link to the channel.\ **views** - text views.\ **likes** - for this dataset, an empty field (meaning the number of emotions).\ **createTime** - publication date in unix time format.\ **createTime** - publication collection date in unix time format.\ **clusterId** - cluster id.\ **vector** - text embedding (its vector representation).\ **ners** - array of identified named entities, where lemma is a lemmatized representation of a word, and label is the name of a tag, start_pos is the starting position of an entity in the text, end_pos is the ending position of an entity in the text.\ **sentiment** - emotional coloring of the text: POSITIVE, NEGATIVE, NEUTRAL.\ **language** - text language RUS, ENG.\ **spam** - text classification as advertising or not NOT_SPAM - no advertising, otherwise SPAM - the text is marked as advertising.\ **length** - number of tokens in the text (words).\ **markedUp** - means that the text is marked or not within the framework of the Skauti service, takes the value true or false. ## Описание на русском языке: Датасет собранный из 30 русскоязычных новостных Telegram каналов на тему Новости и СМИ, собранный и размеченный автоматически с помощью сервиса сбора и разметки данных [Скаути](https://scoutie.ru/).\ Попробуй Скаути и собери такой же или другой датасет по [ссылке](https://scoutie.ru/). ## Поля датасета: **taskId** - идентификатор задачи в сервисе Скаути.\ **text** - основной текст.\ **url** - ссылка на публикацию.\ **sourceLink** - ссылка на Telegram.\ **subSourceLink** - ссылка на канал.\ **views** - просмотры текста.\ **likes** - для данного датасета пустое поле (означающее количество эмоций).\ **createTime** - дата публикации в формате unix time.\ **createTime** - дата сбора публикации в формате unix time.\ **clusterId** - id кластера.\ **vector** - embedding текста (его векторное представление).\ **ners** - массив выявленных именованных сущностей, где lemma - лемматизированное представление слова, а label это название тега, start_pos - начальная позиция сущности в тексте, end_pos - конечная позиция сущности в тексте.\ **sentiment** - эмоциональный окрас текста: POSITIVE, NEGATIVE, NEUTRAL.\ **language** - язык текста RUS, ENG.\ **spam** - классификация текста, как рекламный или нет NOT_SPAM - нет рекламы, иначе SPAM - текст помечен, как рекламный.\ **length** - количество токенов в тексте (слов).\ **markedUp** - означает, что текст размечен или нет в рамках сервиса Скаути принимает значение true или false.

任务类别： - 文本分类（text-classification） - 特征提取（feature-extraction）语言： - 俄语（ru）标签： - 俄罗斯 - 媒体 - 新闻美观名称：俄罗斯新闻与媒体Telegram数据集样本规模：10000 < n < 100000 ## 数据集描述本数据集采集自30个俄语Telegram新闻频道，主题为新闻与媒体领域，通过**Scoutie（Скаути）**数据采集与标注服务（https://scoutie.ru/）自动完成采集与标注。可访问该链接试用Scoutie并采集同类或其他数据集。 ## 数据集字段说明 - **taskId**：Scoutie服务中的任务标识符 - **text**：主文本内容 - **url**：出版物链接 - **sourceLink**：Telegram来源链接 - **subSourceLink**：频道专属链接 - **views**：文本浏览量 - **likes**：本数据集此字段为空，用于表示情感互动数量 - **createTime**：存在两个同名字段，其一为出版物发布时间（Unix时间戳格式），其二为该出版物的采集时间（Unix时间戳格式） - **clusterId**：集群标识符 - **vector**：文本嵌入（即文本的向量表示） - **ners**：已识别的命名实体数组，各实体包含以下属性：lemma为单词的词元化表示，label为标签名称，start_pos为实体在文本中的起始位置，end_pos为实体在文本中的结束位置 - **sentiment**：文本情感倾向，可选值为POSITIVE（积极）、NEGATIVE（消极）、NEUTRAL（中性） - **language**：文本语言，可选值为RUS（俄语）、ENG（英语） - **spam**：文本广告分类结果：NOT_SPAM代表无广告，SPAM代表文本被标记为广告内容 - **length**：文本中的Token（标记）数量（即单词数） - **markedUp**：用于标识该文本是否在Scoutie服务框架内完成标注，取值为true或false

提供机构：

ScoutieAutoML

5,000+

优质数据集

54 个

任务类型

进入经典数据集