louisguitton/dev-ner-ontonotes
收藏Hugging Face2024-05-15 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/louisguitton/dev-ner-ontonotes
下载链接
链接失效反馈官方服务:
资源简介:
---
size_categories: 1K<n<10K
tags:
- argilla
task_categories:
- token-classification
language:
- en
---
# dev-ner-ontonotes
> Validation set of NER dataset OntoNotes5 created with [Argilla](https://docs.argilla.io) for a Argilla Meetup talk.
## Usage
### Load with Argilla
To load with Argilla, you'll just need to install Argilla as `pip install argilla --upgrade` and then use the following code:
```python
import argilla as rg
ds = rg.FeedbackDataset.from_huggingface("louisguitton/dev-ner-ontonotes")
```
### Load with `datasets`
To load this dataset with `datasets`, you'll just need to install `datasets` as `pip install datasets --upgrade` and then use the following code:
```python
from datasets import load_dataset
ds = load_dataset("louisguitton/dev-ner-ontonotes")
```
## Dataset Structure
### Data in Argilla
The dataset is created in Argilla with: **fields**, **questions**, **suggestions**, **metadata**, **vectors**, and **guidelines**.
The **fields** are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions.
| Field Name | Title | Type | Required | Markdown |
| ---------- | ----- | ---- | -------- | -------- |
| text | Text | text | True | False |
The **questions** are the questions that will be asked to the annotators. They can be of different types, such as rating, text, label_selection, multi_label_selection, or ranking.
| Question Name | Title | Type | Required | Description | Values/Labels |
| ------------- | ----- | ---- | -------- | ----------- | ------------- |
| entities | Highlight the entities in the text: | span | True | N/A | N/A |
The **suggestions** are human or machine generated recommendations for each question to assist the annotator during the annotation process, so those are always linked to the existing questions, and named appending "-suggestion" and "-suggestion-metadata" to those, containing the value/s of the suggestion and its metadata, respectively. So on, the possible values are the same as in the table above, but the column name is appended with "-suggestion" and the metadata is appended with "-suggestion-metadata".
The **metadata** is a dictionary that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`.
| Metadata Name | Title | Type | Values | Visible for Annotators |
| ------------- | ----- | ---- | ------ | ---------------------- |
The **guidelines**, are optional as well, and are just a plain string that can be used to provide instructions to the annotators. Find those in the [annotation guidelines](#annotation-guidelines) section.
### Data Instances
An example of a dataset instance in Argilla looks as follows:
```json
{
"external_id": null,
"fields": {
"text": "A Russian diver has found the bodies of three of the 118 sailors who were killed when the nuclear submarine Kursk sank in the Barents Sea ."
},
"metadata": {},
"responses": [],
"suggestions": [
{
"agent": "gold_labels",
"question_name": "entities",
"score": null,
"type": null,
"value": [
{
"end": 9,
"label": "NORP",
"score": 1.0,
"start": 2
},
{
"end": 45,
"label": "CARDINAL",
"score": 1.0,
"start": 40
},
{
"end": 56,
"label": "CARDINAL",
"score": 1.0,
"start": 53
},
{
"end": 113,
"label": "PRODUCT",
"score": 1.0,
"start": 108
},
{
"end": 137,
"label": "LOC",
"score": 1.0,
"start": 122
}
]
}
],
"vectors": {}
}
```
While the same record in HuggingFace `datasets` looks as follows:
```json
{
"entities": [],
"entities-suggestion": {
"end": [
9,
45,
56,
113,
137
],
"label": [
"NORP",
"CARDINAL",
"CARDINAL",
"PRODUCT",
"LOC"
],
"score": [
1.0,
1.0,
1.0,
1.0,
1.0
],
"start": [
2,
40,
53,
108,
122
],
"text": [
"Russian",
"three",
"118",
"Kursk",
"the Barents Sea"
]
},
"entities-suggestion-metadata": {
"agent": "gold_labels",
"score": null,
"type": null
},
"external_id": null,
"metadata": "{}",
"text": "A Russian diver has found the bodies of three of the 118 sailors who were killed when the nuclear submarine Kursk sank in the Barents Sea ."
}
```
### Data Fields
Among the dataset fields, we differentiate between the following:
* **Fields:** These are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions.
* **text** is of type `text`.
* **Questions:** These are the questions that will be asked to the annotators. They can be of different types, such as `RatingQuestion`, `TextQuestion`, `LabelQuestion`, `MultiLabelQuestion`, and `RankingQuestion`.
* **entities** is of type `span`.
* **Suggestions:** As of Argilla 1.13.0, the suggestions have been included to provide the annotators with suggestions to ease or assist during the annotation process. Suggestions are linked to the existing questions, are always optional, and contain not just the suggestion itself, but also the metadata linked to it, if applicable.
* (optional) **entities-suggestion** is of type `span`.
Additionally, we also have two more fields that are optional and are the following:
* **metadata:** This is an optional field that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`.
* **external_id:** This is an optional field that can be used to provide an external ID for the dataset record. This can be useful if you want to link the dataset record to an external resource, such as a database or a file.
### Data Splits
The dataset contains a single split, which is `validation`.
提供机构:
louisguitton
原始信息汇总
数据集概述
基本信息
- 名称: dev-ner-ontonotes
- 大小: 1K<n<10K
- 标签: argilla
- 任务类别: token-classification
- 语言: en
使用方法
-
通过Argilla加载: 使用
pip install argilla --upgrade安装Argilla后,通过以下代码加载数据集: python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("louisguitton/dev-ner-ontonotes") -
通过
datasets加载: 使用pip install datasets --upgrade安装datasets后,通过以下代码加载数据集: python from datasets import load_dataset ds = load_dataset("louisguitton/dev-ner-ontonotes")
数据集结构
- 字段: 目前仅支持文本字段,如
text。 - 问题: 用于向标注者提问,如
entities,类型为span。 - 建议: 与问题关联,辅助标注者,如
entities-suggestion,类型为span。 - 元数据: 提供额外信息,如链接、作者等,为可选字段。
- 指南: 提供标注指南,为可选字段。
数据实例
- Argilla格式: 包含
text字段和entities问题的建议。 - HuggingFace
datasets格式: 包含text字段和entities问题的建议及元数据。
数据字段
- 字段:
text(文本类型) - 问题:
entities(跨度类型) - 建议:
entities-suggestion(跨度类型,可选) - 元数据: 可选,提供额外信息
- external_id: 可选,提供外部ID
数据分割
- 分割: 仅包含
validation分割。



