Filippo/osdg_cd
收藏Hugging Face2023-10-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Filippo/osdg_cd
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
language_creators:
- crowdsourced
language:
- en
license:
- cc-by-4.0
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
task_categories:
- text-classification
task_ids:
- natural-language-inference
pretty_name: OSDG Community Dataset (OSDG-CD)
dataset_info:
config_name: main_config
features:
- name: doi
dtype: string
- name: text_id
dtype: string
- name: text
dtype: string
- name: sdg
dtype: uint16
- name: label
dtype:
class_label:
names:
'0': SDG 1
'1': SDG 2
'2': SDG 3
'3': SDG 4
'4': SDG 5
'5': SDG 6
'6': SDG 7
'7': SDG 8
'8': SDG 9
'9': SDG 10
'10': SDG 11
'11': SDG 12
'12': SDG 13
'13': SDG 14
'14': SDG 15
'15': SDG 16
- name: labels_negative
dtype: uint16
- name: labels_positive
dtype: uint16
- name: agreement
dtype: float32
splits:
- name: train
num_bytes: 30151244
num_examples: 42355
download_size: 29770590
dataset_size: 30151244
---
# Dataset Card for OSDG-CD
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [OSDG-CD homepage](https://zenodo.org/record/8397907)
### Dataset Summary
The OSDG Community Dataset (OSDG-CD) is a public dataset of thousands of text excerpts, which were validated by approximately 1,000 OSDG Community Platform (OSDG-CP) citizen scientists from over 110 countries, with respect to the Sustainable Development Goals (SDGs).
> NOTES
>
> * There are currently no examples for SDGs 16 and 17. See [this GitHub issue](https://github.com/osdg-ai/osdg-data/issues/3).
> * As of July 2023, there areexamples also for SDG 16.
### Supported Tasks and Leaderboards
TBD
### Languages
The language of the dataset is English.
## Dataset Structure
### Data Instances
For each instance, there is a string for the text, a string for the SDG, and an integer for the label.
```
{'text': 'Each section states the economic principle, reviews international good practice and discusses the situation in Brazil.',
'label': 5}
```
The average token count for the premises and hypotheses are given below:
| Feature | Mean Token Count |
| ---------- | ---------------- |
| Premise | 14.1 |
| Hypothesis | 8.3 |
### Data Fields
- `doi`: Digital Object Identifier of the original document
- `text_id`: unique text identifier
- `text`: text excerpt from the document
- `sdg`: the SDG the text is validated against
- `label`: an integer from `0` to `17` which corresponds to the `sdg` field
- `labels_negative`: the number of volunteers who rejected the suggested SDG label
- `labels_positive`: the number of volunteers who accepted the suggested SDG label
- `agreement`: agreement score based on the formula
### Data Splits
The OSDG-CD dataset has 1 splits: _train_.
| Dataset Split | Number of Instances in Split |
| ------------- |----------------------------- |
| Train | 32,327 |
## Dataset Creation
### Curation Rationale
The [The OSDG Community Dataset (OSDG-CD)](https://zenodo.org/record/8397907) was developed as a benchmark for ...
with the goal of producing a dataset large enough to train models using neural methodologies.
### Source Data
#### Initial Data Collection and Normalization
TBD
#### Who are the source language producers?
TBD
### Annotations
#### Annotation process
TBD
#### Who are the annotators?
TBD
### Personal and Sensitive Information
The dataset does not contain any personal information about the authors or the crowdworkers.
## Considerations for Using the Data
### Social Impact of Dataset
TBD
## Additional Information
TBD
### Dataset Curators
TBD
### Licensing Information
The OSDG Community Dataset (OSDG-CD) is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).
### Citation Information
```
@dataset{osdg_2023_8397907,
author = {OSDG and
UNDP IICPSD SDG AI Lab and
PPMI},
title = {OSDG Community Dataset (OSDG-CD)},
month = oct,
year = 2023,
note = {{This CSV file uses UTF-8 character encoding. For
easy access on MS Excel, open the file using Data
→ From Text/CSV. Please split CSV data into
different columns by using a TAB delimiter.}},
publisher = {Zenodo},
version = {2023.10},
doi = {10.5281/zenodo.8397907},
url = {https://doi.org/10.5281/zenodo.8397907}
}
```
### Contributions
TBD
提供机构:
Filippo
原始信息汇总
数据集概述
- 数据集名称: OSDG Community Dataset (OSDG-CD)
- 数据集描述: OSDG-CD是一个包含数千个文本摘录的公共数据集,这些摘录由来自110多个国家的约1,000名OSDG社区平台(OSDG-CP)的公民科学家验证,与可持续发展目标(SDGs)相关。
- 语言: 英语
- 许可证: Creative Commons Attribution 4.0 International License (cc-by-4.0)
- 多语言性: 单语种
- 大小: 10K<n<100K
- 任务类别: 文本分类
- 任务ID: 自然语言推理
数据集结构
- 数据实例: 每个实例包含文本、SDG和标签的字符串。
- 数据字段:
doi: 原始文档的数字对象标识符text_id: 唯一文本标识符text: 文档的文本摘录sdg: 文本验证的SDGlabel: 对应于sdg字段的整数(0到17)labels_negative: 拒绝建议SDG标签的志愿者数量labels_positive: 接受建议SDG标签的志愿者数量agreement: 基于公式的同意分数
- 数据分割: 数据集包含一个分割:训练集。训练集包含32,327个实例。
数据集创建
- 注释创建者: 众包
- 语言创建者: 众包
- 个人和敏感信息: 数据集不包含关于作者或众工的任何个人信息。



