深度学习文本数据集

Name: 深度学习文本数据集
Creator: 北京邮电大学
License: 暂无描述

国家基础学科公共科学数据中心2024-03-05 收录

下载链接：

https://www.nbsdc.cn/general/dataDetail?id=64ef851abb16e0591d025661&type=1

下载链接

链接失效反馈

官方服务：

资源简介：

针对前沿科技专业术语亟需智能识别与自动提取问题，课题组提出融合BiLSTM和CRF的中文专业术语抽取模型。模型研发过程中，产生了深度学习文本数据集，作为课题二相关成果的支撑数据。该数据集包括（1）基于《深度学习500 问》电子书第1-3章48374字的原始文本语料，数据量分别为36KB、60KB、63KB；（2）经人工标注生成的1278句标注数据，数据量分别为110KB、116KB，记录字段包括id-语料编号、text-语料文本内容、annotations-标注详情。

Aiming at the urgent need for intelligent recognition and automatic extraction of frontier scientific and technological professional terminology, the research team proposed a Chinese professional term extraction model integrating BiLSTM and CRF. During the development of the model, a deep learning text dataset was generated as supporting data for the relevant achievements of Project 2. This dataset includes two parts: (1) Original text corpus based on Chapters 1 to 3 of the e-book *Deep Learning 500 Questions*, with a total of 48,374 characters and data sizes of 36KB, 60KB, and 63KB respectively; (2) 1,278 manually annotated sentences, with data sizes of 110KB and 116KB respectively. The record fields include id (corpus number), text (corpus text content), and annotations (annotation details).

提供机构：

北京邮电大学

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是针对前沿科技专业术语智能识别与自动提取问题而开发的，包含基于《深度学习500问》电子书的原始文本语料和人工标注数据，总数据量为408.09KB，文件数为6个。数据集由北京邮电大学的吴俊等作者发布，用于支持融合BiLSTM和CRF的中文专业术语抽取模型的研发。

以上内容由遇见数据集搜集并总结生成