BroDeadlines/TEST.TDT.edu_tdt_proposition_data
收藏Hugging Face2024-07-02 更新2024-07-06 收录
下载链接:
https://hf-mirror.com/datasets/BroDeadlines/TEST.TDT.edu_tdt_proposition_data
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多个特征,如内容、URL、文档ID、分片、分割、命题和命题列表。数据集分为两个主要部分:INDEX.medium_index_TDT和INDEX.medium_index_TDT_clean,每个部分都有特定的字节大小和示例数量。数据集的下载大小为10047733字节,总大小为58368578字节。配置部分描述了数据文件的路径和分割方式。
The dataset includes multiple features such as content, URL, document ID, and various splits and configurations. It specifically describes two main splits: INDEX.medium_index_TDT and INDEX.medium_index_TDT_clean, each with their specific data file paths and sizes. Additionally, detailed descriptions of data processing methods and error information are provided.
提供机构:
BroDeadlines
原始信息汇总
数据集概述
数据集信息
-
特征:
content: 类型为stringurl: 类型为stringdoc_id: 类型为stringshards: 类型为int64splits: 类型为sequence的stringsplit: 类型为sequence的stringpropositions: 类型为sequence的stringproposition_list: 类型为sequence的string
-
分割:
INDEX.medium_index_TDT:- 字节数: 29192895
- 样本数: 344
INDEX.medium_index_TDT_clean:- 字节数: 29175683
- 样本数: 344
-
下载大小: 10047733 字节
-
数据集大小: 58368578 字节
配置
- 配置名称:
default- 数据文件:
split:INDEX.medium_index_TDT- 路径:
data/INDEX.medium_index_TDT-*
- 路径:
split:INDEX.medium_index_TDT_clean- 路径:
data/INDEX.medium_index_TDT_clean-*
- 路径:
- 数据文件:
其他信息
-
propositon_medium_edu_tdt:
vector_index:vec-propositon_medium_edu_tdttext_index:text-propositon_medium_edu_tdtmethod:["split", "proposition"]step: 50chunk_size: 400time(min):4.36errors:["3369b8d5-1b47-11ef-a755-d38426455a06", "ebe87ce2-13cc-11ef-b548-0242ac1c000c"]
-
INDEX.medium_index_TDT_clean:
vector_index:vec-sentence-index.medium_index_tdt_cleantext_index:text-sentence-index.medium_index_tdt_cleanmethod:["fulltext", "clean", "proposition"]errors:["ebe87ce2-13cc-11ef-b548-0242ac1c000c"]



