lukesjordan/worldbank-project-documents
收藏Hugging Face2022-10-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/lukesjordan/worldbank-project-documents
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- en
license:
- other
multilinguality:
- monolingual
size_categories:
- unknown
source_datasets:
- original
task_categories:
- table-to-text
- question-answering
- summarization
- text-generation
task_ids:
- abstractive-qa
- closed-domain-qa
- extractive-qa
- language-modeling
- named-entity-recognition
- text-simplification
pretty_name: worldbank_project_documents
language_bcp47:
- en-US
tags:
- conditional-text-generation
- structure-prediction
---
# Dataset Card for World Bank Project Documents
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:**
- **Repository:** https://github.com/luke-grassroot/aid-outcomes-ml
- **Paper:** Forthcoming
- **Point of Contact:** Luke Jordan (lukej at mit)
### Dataset Summary
This is a dataset of documents related to World Bank development projects in the period 1947-2020. The dataset includes
the documents used to propose or describe projects when they are launched, and those in the review. The documents are indexed
by the World Bank project ID, which can be used to obtain features from multiple publicly available tabular datasets.
### Supported Tasks and Leaderboards
No leaderboard yet. A wide range of possible supported tasks, including varieties of summarization, QA, and language modelling. To date, the datasets have been used primarily in conjunction with tabular data (via BERT embeddings) to predict project outcomes.
### Languages
English
## Dataset Structure
### Data Instances
### Data Fields
* World Bank project ID
* Document text
* Document type: "APPROVAL" for documents written at the beginning of a project, when it is approved; and "REVIEW" for documents written at the end of a project
### Data Splits
To allow for open exploration, and since different applications will want to do splits based on different sampling weights, we have not done a train test split but left all files in the train branch.
## Dataset Creation
### Source Data
Documents were scraped from the World Bank's public project archive, following links through to specific project pages and then collecting the text files made available by the [World Bank](https://projects.worldbank.org/en/projects-operations/projects-home).
### Annotations
This dataset is not annotated.
### Personal and Sensitive Information
None.
## Considerations for Using the Data
### Social Impact of Dataset
Affects development projects, which can have large-scale consequences for many millions of people.
### Discussion of Biases
The documents reflect the history of development, which has well-documented and well-studied issues with the imposition of developed world ideas on developing world countries. The documents provide a way to study those in the field of development, but should not be used for their description of the recipient countries, since that language will reflect a multitude of biases, especially in the earlier reaches of the historical projects.
## Additional Information
### Dataset Curators
Luke Jordan, Busani Ndlovu.
### Licensing Information
MIT +no-false-attribs license (MITNFA).
### Citation Information
@dataset{world-bank-project-documents,
author = {Jordan, Luke and Ndlovu, Busani and Shenk, Justin},
title = {World Bank Project Documents Dataset},
year = {2021}
}
### Contributions
Thanks to [@luke-grassroot](https://github.com/luke-grassroot), [@FRTNX](https://github.com/FRTNX/) and [@justinshenk](https://github.com/justinshenk) for adding this dataset.
提供机构:
lukesjordan
原始信息汇总
数据集概述:World Bank Project Documents
基本信息
- 名称: World Bank Project Documents
- 语言: 英语(en-US)
- 许可证: MIT +no-false-attribs license (MITNFA)
- 多语言性: 单语种
- 数据来源: 原始数据
- 数据大小: 未知
- 任务类别:
- 表格到文本转换
- 问答
- 摘要生成
- 文本生成
- 语言模型
- 命名实体识别
- 文本简化
数据集描述
数据集总结
- 内容: 包含1947-2020年间世界银行发展项目的相关文档,包括项目启动时的提案或描述文档以及项目审查文档。
- 索引: 通过世界银行项目ID进行索引,可用于从多个公开的表格数据集中提取特征。
支持的任务和排行榜
- 任务: 包括多种摘要生成、问答和语言模型任务。
- 使用情况: 主要与表格数据结合使用,通过BERT嵌入预测项目结果。
数据集结构
- 数据实例: 包含世界银行项目ID、文档文本和文档类型("APPROVAL"或"REVIEW")。
- 数据分割: 未进行训练测试分割,所有文件均在训练分支中,以支持开放探索。
数据集创建
源数据
- 获取方式: 从世界银行公共项目档案中抓取,通过特定项目页面收集文本文件。
注释
- 注释情况: 无注释。
个人和敏感信息
- 信息情况: 无个人和敏感信息。
使用数据集的考虑
社会影响
- 影响范围: 影响发展项目,可能对数百万人的生活产生重大影响。
偏见讨论
- 偏见描述: 文档反映了发展的历史,存在已知的问题,如发达国家对发展中国家的思想强加。这些文档可用于研究领域内的偏见,但不应用于描述接收国的情况,因为该语言将反映多种偏见,尤其是在历史项目的早期阶段。



