GLINER-multi-task-synthetic-data

Name: GLINER-multi-task-synthetic-data
Creator: maas
Published: 2025-12-10 16:19:25
License: 暂无描述

魔搭社区2025-12-10 更新2024-12-28 收录

下载链接：

https://modelscope.cn/datasets/knowledgator/GLINER-multi-task-synthetic-data

下载链接

链接失效反馈

官方服务：

资源简介：

This is official synthetic dataset used to train GLiNER multi-task model. The dataset is a list of dictionaries consisting a tokenized text with named entity recognition (NER) information. Each item represents of two main components: 1. 'tokenized_text': A list of individual words and punctuation marks from the original text, split into tokens. 2. 'ner': A list of lists containing named entity recognition information. Each inner list has three elements: - Start index of the named entity in the tokenized text - End index of the named entity in the tokenized text - Label 'match' for the identified entity The dataset was pre-annotated with Llama3-8B feeding Wikipedia articles. **Supported tasks:** - Named Entity Recognition (NER): Identifies and categorizes entities such as names, organizations, dates, and other specific items in the text. - Relation Extraction: Detects and classifies relationships between entities within the text. - Summarization: Extract the most important sentences that summarize the input text, capturing the essential information. - Sentiment Extraction: Identify parts of the text that signalize a positive, negative, or neutral sentiment; - Key-Phrase Extraction: Identifies and extracts important phrases and keywords from the text. - Question-answering: Finding an answer in the text given a question; - Open Information Extraction: Extracts pieces of text given an open prompt from a user, for example, product description extraction;

本数据集为用于训练GLiNER多任务模型的官方合成数据集。该数据集为若干字典构成的列表，其中包含带有命名实体识别（Named Entity Recognition，NER）信息的分词文本。每条数据包含两个核心组成部分： 1. `tokenized_text`：由原始文本拆分得到的单个词汇与标点符号组成的Token列表。 2. `ner`：包含命名实体识别信息的嵌套列表。每个内层列表包含三个元素： - 该命名实体在分词文本中的起始索引； - 该命名实体在分词文本中的终止索引； - 所识别实体的类别标签`match`。该数据集由Llama3-8B对维基百科文章进行预标注生成。 **支持任务：** - 命名实体识别（NER）：对文本中的名称、组织机构、日期及其他特定实体进行识别与分类。 - 关系抽取：检测并分类文本内实体间的语义关系。 - 文本摘要：提取能够概括输入文本核心信息的关键语句，保留核心要义。 - 情感提取：识别文本中传递积极、消极或中性情感的片段； - 关键词短语提取：从文本中识别并提取重要短语与关键词。 - 问答任务：根据给定问题从文本中定位对应答案； - 开放信息抽取：根据用户的开放式提示提取文本片段，例如商品描述抽取。

提供机构：

maas

创建时间：

2024-12-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集