yapeichang/WebOrganizer-format-topic-merged-Llama-3.1-8B
收藏Hugging Face2025-07-14 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/yapeichang/WebOrganizer-format-topic-merged-Llama-3.1-8B
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含文本内容、URL、格式选择概率、格式选择索引、格式选择标签、元数据、主题选择概率、主题选择索引和主题选择标签等字段。元数据中包含了关于网页记录的详细信息,如内容长度、内容类型、Warc记录标识等。数据集被分割为训练集,其中包含一百万个示例,总大小为4.44GB。
The dataset includes fields such as text content, URL, format selection probability, format selection index, format selection label, metadata, topic selection probability, topic selection index, and topic selection label. The metadata contains detailed information about web records, such as content length, content type, Warc record identifier, etc. The dataset is split into a training set, which contains one million examples and has a total size of 4.44GB.
提供机构:
yapeichang



