TopicNet/Lenta
收藏Hugging Face2024-03-18 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/TopicNet/Lenta
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ru
multilinguality:
- monolingual
license: other
license_name: topicnet
license_link: >-
https://github.com/machine-intelligence-laboratory/TopicNet/blob/master/LICENSE.txt
task_categories:
- text-classification
task_ids:
- topic-classification
- multi-class-classification
- multi-label-classification
tags:
- topic-modeling
- topic-modelling
- text-clustering
- multimodal-data
- multimodal-learning
- modalities
- document-representation
---
# Lenta
Some measurable characteristics of the dataset:
* D — number of documents
* <modality name> W — modality dictionary size (number of unique tokens)
* <modality name> len D — average document length in modality tokens (number of tokens)
* <modality name> len D uniq — average document length in unique modality tokens (number of unique tokens)
| | D | @topmine W | @topmine len D | @topmine len D uniq | @time_n W | @time_n len D | @time_n len D uniq | @lemmatized_title W | @lemmatized_title len D | @lemmatized_title len D uniq | @lemmatized W | @lemmatized len D | @lemmatized len D uniq | @theme W | @theme len D | @theme len D uniq |
|:------|------------:|--------------------:|------------------------:|-----------------------------:|-------------------:|-----------------------:|----------------------------:|-----------------------------:|---------------------------------:|--------------------------------------:|-----------------------:|---------------------------:|--------------------------------:|------------------:|----------------------:|---------------------------:|
| value | 263557 | 2.32892e+07 | 88.365 | 83.8258 | 263557 | 1 | 1 | 2.05546e+06 | 7.79894 | 7.72848 | 2.90254e+07 | 110.13 | 84.5878 | 383816 | 1.45629 | 1.45629 |
Information about document lengths in modality tokens:
| | len_total@topmine | len_total@time_n | len_total@lemmatized_title | len_total@lemmatized | len_total@theme | len_uniq@topmine | len_uniq@time_n | len_uniq@lemmatized_title | len_uniq@lemmatized | len_uniq@theme |
|:-----|--------------------:|-------------------:|-----------------------------:|-----------------------:|------------------:|-------------------:|------------------:|----------------------------:|----------------------:|-----------------:|
| mean | 88.365 | 1 | 7.79894 | 110.13 | 1.45629 | 83.8258 | 1 | 7.72848 | 84.5878 | 1.45629 |
| std | 50.2072 | 0 | 1.86916 | 39.7804 | 0.722741 | 47.5763 | 0 | 1.81461 | 26.7959 | 0.722741 |
| min | 1 | 1 | 1 | 7 | 1 | 1 | 1 | 1 | 7 | 1 |
| 25% | 54 | 1 | 6 | 83 | 1 | 51 | 1 | 6 | 66 | 1 |
| 50% | 77 | 1 | 8 | 104 | 1 | 73 | 1 | 8 | 81 | 1 |
| 75% | 110 | 1 | 9 | 131 | 2 | 104 | 1 | 9 | 99 | 2 |
| max | 791 | 1 | 17 | 1000 | 3 | 647 | 1 | 16 | 542 | 3 |
提供机构:
TopicNet
原始信息汇总
数据集概述
基本信息
- 语言: 俄语
- 多语言性: 单语种
- 许可证: other (topicnet)
- 任务类别: 文本分类
- 任务ID: 主题分类, 多类分类, 多标签分类
- 标签: 主题建模, 文本聚类, 多模态数据, 多模态学习, 模态, 文档表示
数据集特征
- 文档数量 (D): 263557
- 各模态的词典大小 (W):
- @topmine: 2.32892e+07
- @time_n: 263557
- @lemmatized_title: 2.05546e+06
- @lemmatized: 2.90254e+07
- @theme: 383816
- 各模态的平均文档长度 (len D):
- @topmine: 88.365
- @time_n: 1
- @lemmatized_title: 7.79894
- @lemmatized: 110.13
- @theme: 1.45629
- 各模态的平均文档长度(唯一模态标记)(len D uniq):
- @topmine: 83.8258
- @time_n: 1
- @lemmatized_title: 7.72848
- @lemmatized: 84.5878
- @theme: 1.45629
文档长度统计
- 各模态的总长度 (len_total):
- @topmine: 88.365
- @time_n: 1
- @lemmatized_title: 7.79894
- @lemmatized: 110.13
- @theme: 1.45629
- 各模态的唯一标记长度 (len_uniq):
- @topmine: 83.8258
- @time_n: 1
- @lemmatized_title: 7.72848
- @lemmatized: 84.5878
- @theme: 1.45629
统计细节
- 均值 (mean):
- @topmine: 88.365
- @time_n: 1
- @lemmatized_title: 7.79894
- @lemmatized: 110.13
- @theme: 1.45629
- 标准差 (std):
- @topmine: 50.2072
- @time_n: 0
- @lemmatized_title: 1.86916
- @lemmatized: 39.7804
- @theme: 0.722741
- 最小值 (min):
- @topmine: 1
- @time_n: 1
- @lemmatized_title: 1
- @lemmatized: 7
- @theme: 1
- 25% 分位数 (25%):
- @topmine: 54
- @time_n: 1
- @lemmatized_title: 6
- @lemmatized: 83
- @theme: 1
- 50% 分位数 (50%):
- @topmine: 77
- @time_n: 1
- @lemmatized_title: 8
- @lemmatized: 104
- @theme: 1
- 75% 分位数 (75%):
- @topmine: 110
- @time_n: 1
- @lemmatized_title: 9
- @lemmatized: 131
- @theme: 2
- 最大值 (max):
- @topmine: 791
- @time_n: 1
- @lemmatized_title: 17
- @lemmatized: 1000
- @theme: 3



