louisguitton/dev-ner-ontonotes

Name: louisguitton/dev-ner-ontonotes
Creator: louisguitton
Published: 2024-05-15 09:14:06
License: 暂无描述

Hugging Face2024-05-15 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/louisguitton/dev-ner-ontonotes

下载链接

链接失效反馈

官方服务：

资源简介：

--- size_categories: 1K<n<10K tags: - argilla task_categories: - token-classification language: - en --- # dev-ner-ontonotes > Validation set of NER dataset OntoNotes5 created with [Argilla](https://docs.argilla.io) for a Argilla Meetup talk. ## Usage ### Load with Argilla To load with Argilla, you'll just need to install Argilla as `pip install argilla --upgrade` and then use the following code: ```python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("louisguitton/dev-ner-ontonotes") ``` ### Load with `datasets` To load this dataset with `datasets`, you'll just need to install `datasets` as `pip install datasets --upgrade` and then use the following code: ```python from datasets import load_dataset ds = load_dataset("louisguitton/dev-ner-ontonotes") ``` ## Dataset Structure ### Data in Argilla The dataset is created in Argilla with: **fields**, **questions**, **suggestions**, **metadata**, **vectors**, and **guidelines**. The **fields** are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions. | Field Name | Title | Type | Required | Markdown | | ---------- | ----- | ---- | -------- | -------- | | text | Text | text | True | False | The **questions** are the questions that will be asked to the annotators. They can be of different types, such as rating, text, label_selection, multi_label_selection, or ranking. | Question Name | Title | Type | Required | Description | Values/Labels | | ------------- | ----- | ---- | -------- | ----------- | ------------- | | entities | Highlight the entities in the text: | span | True | N/A | N/A | The **suggestions** are human or machine generated recommendations for each question to assist the annotator during the annotation process, so those are always linked to the existing questions, and named appending "-suggestion" and "-suggestion-metadata" to those, containing the value/s of the suggestion and its metadata, respectively. So on, the possible values are the same as in the table above, but the column name is appended with "-suggestion" and the metadata is appended with "-suggestion-metadata". The **metadata** is a dictionary that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`. | Metadata Name | Title | Type | Values | Visible for Annotators | | ------------- | ----- | ---- | ------ | ---------------------- | The **guidelines**, are optional as well, and are just a plain string that can be used to provide instructions to the annotators. Find those in the [annotation guidelines](#annotation-guidelines) section. ### Data Instances An example of a dataset instance in Argilla looks as follows: ```json { "external_id": null, "fields": { "text": "A Russian diver has found the bodies of three of the 118 sailors who were killed when the nuclear submarine Kursk sank in the Barents Sea ." }, "metadata": {}, "responses": [], "suggestions": [ { "agent": "gold_labels", "question_name": "entities", "score": null, "type": null, "value": [ { "end": 9, "label": "NORP", "score": 1.0, "start": 2 }, { "end": 45, "label": "CARDINAL", "score": 1.0, "start": 40 }, { "end": 56, "label": "CARDINAL", "score": 1.0, "start": 53 }, { "end": 113, "label": "PRODUCT", "score": 1.0, "start": 108 }, { "end": 137, "label": "LOC", "score": 1.0, "start": 122 } ] } ], "vectors": {} } ``` While the same record in HuggingFace `datasets` looks as follows: ```json { "entities": [], "entities-suggestion": { "end": [ 9, 45, 56, 113, 137 ], "label": [ "NORP", "CARDINAL", "CARDINAL", "PRODUCT", "LOC" ], "score": [ 1.0, 1.0, 1.0, 1.0, 1.0 ], "start": [ 2, 40, 53, 108, 122 ], "text": [ "Russian", "three", "118", "Kursk", "the Barents Sea" ] }, "entities-suggestion-metadata": { "agent": "gold_labels", "score": null, "type": null }, "external_id": null, "metadata": "{}", "text": "A Russian diver has found the bodies of three of the 118 sailors who were killed when the nuclear submarine Kursk sank in the Barents Sea ." } ``` ### Data Fields Among the dataset fields, we differentiate between the following: * **Fields:** These are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions. * **text** is of type `text`. * **Questions:** These are the questions that will be asked to the annotators. They can be of different types, such as `RatingQuestion`, `TextQuestion`, `LabelQuestion`, `MultiLabelQuestion`, and `RankingQuestion`. * **entities** is of type `span`. * **Suggestions:** As of Argilla 1.13.0, the suggestions have been included to provide the annotators with suggestions to ease or assist during the annotation process. Suggestions are linked to the existing questions, are always optional, and contain not just the suggestion itself, but also the metadata linked to it, if applicable. * (optional) **entities-suggestion** is of type `span`. Additionally, we also have two more fields that are optional and are the following: * **metadata:** This is an optional field that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`. * **external_id:** This is an optional field that can be used to provide an external ID for the dataset record. This can be useful if you want to link the dataset record to an external resource, such as a database or a file. ### Data Splits The dataset contains a single split, which is `validation`.

提供机构：

louisguitton

原始信息汇总

数据集概述

基本信息

名称: dev-ner-ontonotes
大小: 1K<n<10K
标签: argilla
任务类别: token-classification
语言: en

使用方法

通过Argilla加载: 使用pip install argilla --upgrade安装Argilla后，通过以下代码加载数据集： python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("louisguitton/dev-ner-ontonotes")
通过datasets加载: 使用pip install datasets --upgrade安装datasets后，通过以下代码加载数据集： python from datasets import load_dataset ds = load_dataset("louisguitton/dev-ner-ontonotes")

数据集结构

字段: 目前仅支持文本字段，如text。
问题: 用于向标注者提问，如entities，类型为span。
建议: 与问题关联，辅助标注者，如entities-suggestion，类型为span。
元数据: 提供额外信息，如链接、作者等，为可选字段。
指南: 提供标注指南，为可选字段。

数据实例

Argilla格式: 包含text字段和entities问题的建议。
HuggingFace datasets格式: 包含text字段和entities问题的建议及元数据。

数据字段

字段: text（文本类型）
问题: entities（跨度类型）
建议: entities-suggestion（跨度类型，可选）
元数据: 可选，提供额外信息
external_id: 可选，提供外部ID

数据分割

分割: 仅包含validation分割。

5,000+

优质数据集

54 个

任务类型

进入经典数据集