five

Filippo/osdg_cd

收藏
Hugging Face2023-10-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Filippo/osdg_cd
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language_creators: - crowdsourced language: - en license: - cc-by-4.0 multilinguality: - monolingual size_categories: - 10K<n<100K task_categories: - text-classification task_ids: - natural-language-inference pretty_name: OSDG Community Dataset (OSDG-CD) dataset_info: config_name: main_config features: - name: doi dtype: string - name: text_id dtype: string - name: text dtype: string - name: sdg dtype: uint16 - name: label dtype: class_label: names: '0': SDG 1 '1': SDG 2 '2': SDG 3 '3': SDG 4 '4': SDG 5 '5': SDG 6 '6': SDG 7 '7': SDG 8 '8': SDG 9 '9': SDG 10 '10': SDG 11 '11': SDG 12 '12': SDG 13 '13': SDG 14 '14': SDG 15 '15': SDG 16 - name: labels_negative dtype: uint16 - name: labels_positive dtype: uint16 - name: agreement dtype: float32 splits: - name: train num_bytes: 30151244 num_examples: 42355 download_size: 29770590 dataset_size: 30151244 --- # Dataset Card for OSDG-CD ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [OSDG-CD homepage](https://zenodo.org/record/8397907) ### Dataset Summary The OSDG Community Dataset (OSDG-CD) is a public dataset of thousands of text excerpts, which were validated by approximately 1,000 OSDG Community Platform (OSDG-CP) citizen scientists from over 110 countries, with respect to the Sustainable Development Goals (SDGs). > NOTES > > * There are currently no examples for SDGs 16 and 17. See [this GitHub issue](https://github.com/osdg-ai/osdg-data/issues/3). > * As of July 2023, there areexamples also for SDG 16. ### Supported Tasks and Leaderboards TBD ### Languages The language of the dataset is English. ## Dataset Structure ### Data Instances For each instance, there is a string for the text, a string for the SDG, and an integer for the label. ``` {'text': 'Each section states the economic principle, reviews international good practice and discusses the situation in Brazil.', 'label': 5} ``` The average token count for the premises and hypotheses are given below: | Feature | Mean Token Count | | ---------- | ---------------- | | Premise | 14.1 | | Hypothesis | 8.3 | ### Data Fields - `doi`: Digital Object Identifier of the original document - `text_id`: unique text identifier - `text`: text excerpt from the document - `sdg`: the SDG the text is validated against - `label`: an integer from `0` to `17` which corresponds to the `sdg` field - `labels_negative`: the number of volunteers who rejected the suggested SDG label - `labels_positive`: the number of volunteers who accepted the suggested SDG label - `agreement`: agreement score based on the formula ### Data Splits The OSDG-CD dataset has 1 splits: _train_. | Dataset Split | Number of Instances in Split | | ------------- |----------------------------- | | Train | 32,327 | ## Dataset Creation ### Curation Rationale The [The OSDG Community Dataset (OSDG-CD)](https://zenodo.org/record/8397907) was developed as a benchmark for ... with the goal of producing a dataset large enough to train models using neural methodologies. ### Source Data #### Initial Data Collection and Normalization TBD #### Who are the source language producers? TBD ### Annotations #### Annotation process TBD #### Who are the annotators? TBD ### Personal and Sensitive Information The dataset does not contain any personal information about the authors or the crowdworkers. ## Considerations for Using the Data ### Social Impact of Dataset TBD ## Additional Information TBD ### Dataset Curators TBD ### Licensing Information The OSDG Community Dataset (OSDG-CD) is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/). ### Citation Information ``` @dataset{osdg_2023_8397907, author = {OSDG and UNDP IICPSD SDG AI Lab and PPMI}, title = {OSDG Community Dataset (OSDG-CD)}, month = oct, year = 2023, note = {{This CSV file uses UTF-8 character encoding. For easy access on MS Excel, open the file using Data → From Text/CSV. Please split CSV data into different columns by using a TAB delimiter.}}, publisher = {Zenodo}, version = {2023.10}, doi = {10.5281/zenodo.8397907}, url = {https://doi.org/10.5281/zenodo.8397907} } ``` ### Contributions TBD
提供机构:
Filippo
原始信息汇总

数据集概述

  • 数据集名称: OSDG Community Dataset (OSDG-CD)
  • 数据集描述: OSDG-CD是一个包含数千个文本摘录的公共数据集,这些摘录由来自110多个国家的约1,000名OSDG社区平台(OSDG-CP)的公民科学家验证,与可持续发展目标(SDGs)相关。
  • 语言: 英语
  • 许可证: Creative Commons Attribution 4.0 International License (cc-by-4.0)
  • 多语言性: 单语种
  • 大小: 10K<n<100K
  • 任务类别: 文本分类
  • 任务ID: 自然语言推理

数据集结构

  • 数据实例: 每个实例包含文本、SDG和标签的字符串。
  • 数据字段:
    • doi: 原始文档的数字对象标识符
    • text_id: 唯一文本标识符
    • text: 文档的文本摘录
    • sdg: 文本验证的SDG
    • label: 对应于sdg字段的整数(0到17)
    • labels_negative: 拒绝建议SDG标签的志愿者数量
    • labels_positive: 接受建议SDG标签的志愿者数量
    • agreement: 基于公式的同意分数
  • 数据分割: 数据集包含一个分割:训练集。训练集包含32,327个实例。

数据集创建

  • 注释创建者: 众包
  • 语言创建者: 众包
  • 个人和敏感信息: 数据集不包含关于作者或众工的任何个人信息。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作