异质信息网络短文本建模数据集

Name: 异质信息网络短文本建模数据集
Creator: 安徽大学
License: 暂无描述

国家基础学科公共科学数据中心2024-03-05 收录

下载链接：

https://www.nbsdc.cn/general/dataDetail?id=64edc893bb16e07753c3545b&type=1

下载链接

链接失效反馈

官方服务：

资源简介：

本数据集来自于Stackoverflow、Tweet和Biomedicine数据集，包含Stackoverflow和Tweet等网络平台上各类用户的短文本数据和对应的标签数据，以对多源异构的短文本数据进行建模、构建用户画像并精准预测用户需求。本数据集根据应用场景可以划分为三个子数据集，分别为Stackoverflow、Tweet和Biomedicine，其中Stackoverflow数据集包含该平台上收集到的不同问题的标题短文本信息以及对应标签数据，Tweet数据集包含该平台上收集到的用户各类推文以及对应的标签数据，Biomedicine数据集包含对应平台搜集的医学相关短文本数据和对应的标签数据。处理后的完整数据集的各类数据共计23,920条，包含的数据格式为.txt。

This dataset is sourced from three datasets: Stackoverflow, Tweet, and Biomedicine. It contains short text data and corresponding label data from various users on online platforms including Stackoverflow and Tweet, for the purpose of modeling multi-source heterogeneous short text data, constructing user profiles, and accurately predicting user demands. This dataset is divided into three subsets based on application scenarios: Stackoverflow, Tweet, and Biomedicine. The Stackoverflow subset includes short text titles of various collected questions from the platform along with their corresponding label data. The Tweet subset contains various user tweets gathered from this platform and their corresponding label data. The Biomedicine subset covers medical-related short text data and their corresponding label data collected from the relevant platform. The total number of entries in the processed complete dataset is 23,920, and all data is stored in .txt format.

提供机构：

安徽大学

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集整合了Stackoverflow、Tweet和Biomedicine三个平台的短文本及标签数据，旨在通过异质信息网络建模进行用户画像构建与需求预测。它包含23,920条处理后的文本数据，划分为三个子集，分别对应不同应用场景。

以上内容由遇见数据集搜集并总结生成