five

TopicNet/Lenta

收藏
Hugging Face2024-03-18 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/TopicNet/Lenta
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ru multilinguality: - monolingual license: other license_name: topicnet license_link: >- https://github.com/machine-intelligence-laboratory/TopicNet/blob/master/LICENSE.txt task_categories: - text-classification task_ids: - topic-classification - multi-class-classification - multi-label-classification tags: - topic-modeling - topic-modelling - text-clustering - multimodal-data - multimodal-learning - modalities - document-representation --- # Lenta Some measurable characteristics of the dataset: * D — number of documents * <modality name> W — modality dictionary size (number of unique tokens) * <modality name> len D — average document length in modality tokens (number of tokens) * <modality name> len D uniq — average document length in unique modality tokens (number of unique tokens) | | D | @topmine W | @topmine len D | @topmine len D uniq | @time_n W | @time_n len D | @time_n len D uniq | @lemmatized_title W | @lemmatized_title len D | @lemmatized_title len D uniq | @lemmatized W | @lemmatized len D | @lemmatized len D uniq | @theme W | @theme len D | @theme len D uniq | |:------|------------:|--------------------:|------------------------:|-----------------------------:|-------------------:|-----------------------:|----------------------------:|-----------------------------:|---------------------------------:|--------------------------------------:|-----------------------:|---------------------------:|--------------------------------:|------------------:|----------------------:|---------------------------:| | value | 263557 | 2.32892e+07 | 88.365 | 83.8258 | 263557 | 1 | 1 | 2.05546e+06 | 7.79894 | 7.72848 | 2.90254e+07 | 110.13 | 84.5878 | 383816 | 1.45629 | 1.45629 | Information about document lengths in modality tokens: | | len_total@topmine | len_total@time_n | len_total@lemmatized_title | len_total@lemmatized | len_total@theme | len_uniq@topmine | len_uniq@time_n | len_uniq@lemmatized_title | len_uniq@lemmatized | len_uniq@theme | |:-----|--------------------:|-------------------:|-----------------------------:|-----------------------:|------------------:|-------------------:|------------------:|----------------------------:|----------------------:|-----------------:| | mean | 88.365 | 1 | 7.79894 | 110.13 | 1.45629 | 83.8258 | 1 | 7.72848 | 84.5878 | 1.45629 | | std | 50.2072 | 0 | 1.86916 | 39.7804 | 0.722741 | 47.5763 | 0 | 1.81461 | 26.7959 | 0.722741 | | min | 1 | 1 | 1 | 7 | 1 | 1 | 1 | 1 | 7 | 1 | | 25% | 54 | 1 | 6 | 83 | 1 | 51 | 1 | 6 | 66 | 1 | | 50% | 77 | 1 | 8 | 104 | 1 | 73 | 1 | 8 | 81 | 1 | | 75% | 110 | 1 | 9 | 131 | 2 | 104 | 1 | 9 | 99 | 2 | | max | 791 | 1 | 17 | 1000 | 3 | 647 | 1 | 16 | 542 | 3 |
提供机构:
TopicNet
原始信息汇总

数据集概述

基本信息

  • 语言: 俄语
  • 多语言性: 单语种
  • 许可证: other (topicnet)
  • 任务类别: 文本分类
  • 任务ID: 主题分类, 多类分类, 多标签分类
  • 标签: 主题建模, 文本聚类, 多模态数据, 多模态学习, 模态, 文档表示

数据集特征

  • 文档数量 (D): 263557
  • 各模态的词典大小 (W):
    • @topmine: 2.32892e+07
    • @time_n: 263557
    • @lemmatized_title: 2.05546e+06
    • @lemmatized: 2.90254e+07
    • @theme: 383816
  • 各模态的平均文档长度 (len D):
    • @topmine: 88.365
    • @time_n: 1
    • @lemmatized_title: 7.79894
    • @lemmatized: 110.13
    • @theme: 1.45629
  • 各模态的平均文档长度(唯一模态标记)(len D uniq):
    • @topmine: 83.8258
    • @time_n: 1
    • @lemmatized_title: 7.72848
    • @lemmatized: 84.5878
    • @theme: 1.45629

文档长度统计

  • 各模态的总长度 (len_total):
    • @topmine: 88.365
    • @time_n: 1
    • @lemmatized_title: 7.79894
    • @lemmatized: 110.13
    • @theme: 1.45629
  • 各模态的唯一标记长度 (len_uniq):
    • @topmine: 83.8258
    • @time_n: 1
    • @lemmatized_title: 7.72848
    • @lemmatized: 84.5878
    • @theme: 1.45629

统计细节

  • 均值 (mean):
    • @topmine: 88.365
    • @time_n: 1
    • @lemmatized_title: 7.79894
    • @lemmatized: 110.13
    • @theme: 1.45629
  • 标准差 (std):
    • @topmine: 50.2072
    • @time_n: 0
    • @lemmatized_title: 1.86916
    • @lemmatized: 39.7804
    • @theme: 0.722741
  • 最小值 (min):
    • @topmine: 1
    • @time_n: 1
    • @lemmatized_title: 1
    • @lemmatized: 7
    • @theme: 1
  • 25% 分位数 (25%):
    • @topmine: 54
    • @time_n: 1
    • @lemmatized_title: 6
    • @lemmatized: 83
    • @theme: 1
  • 50% 分位数 (50%):
    • @topmine: 77
    • @time_n: 1
    • @lemmatized_title: 8
    • @lemmatized: 104
    • @theme: 1
  • 75% 分位数 (75%):
    • @topmine: 110
    • @time_n: 1
    • @lemmatized_title: 9
    • @lemmatized: 131
    • @theme: 2
  • 最大值 (max):
    • @topmine: 791
    • @time_n: 1
    • @lemmatized_title: 17
    • @lemmatized: 1000
    • @theme: 3
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作