GIZ/policy_qa_v0_1

Hugging Face2023-07-14 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/GIZ/policy_qa_v0_1

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - question-answering - text-classification language: - en - fr - es size_categories: - 10K<n<100K tags: - climate - policy --- This dataset is curated by [GIZ Data Service Center](https://www.giz.de/expertise/html/63018.html). The source dataset for this comes from Internal GIZ team (IKI_Tracs) and [Climatewatchdata](https://www.climatewatchdata.org/data-explorer/historical-emissions?historical-emissions-data-sources=climate-watch&historical-emissions-gases=all-ghg&historical-emissions-regions=All%20Selected&historical-emissions-sectors=total-including-lucf%2Ctotal-including-lucf&page=1), where Climatewatch has analysed Intended nationally determined contribution (INDC), NDC and Revised/Updated NDC of the countries to answer some important questions related to Climate change. Specifications - Dataset size: ~85k - Language: English, French, Spanish # Columns - **index (type:int)**: Unique Response ID - **ResponseText (type:str)**: Annotated answer/response to query - **Alpha3 (type:str)**:country alpha-3 code (ISO 3166) - **Country (type:str)**: country name - **Document (type:str)**:Name of type of Policy document from which response is provided - **IkiInfo (type: list[dict])**: Responsetext can appear/occur as answer/response for different kind of query, therefore in that case we preserve all raw information for each occurences. Each dictionary object represents one such occurrence for response and provides all raw metadata for an occurrence.In case of None, it means the entry belongs to Climate data and not IKI Tracs data) - **CWInfo (type: list[dict])**:Responsetext can appear/occur as answer/response for different kind of query, therefore in that case we preserve all raw information for each occurences. Each dictionary object represents one such occurrence for response and provides all raw metadata for an occurrence. In case of None, it means the entry belongs to Iki tracs data and not CW) - **Source (type:list[str])**: Contains the name of source - **Target (type:list)**: Value at index 0, represents number of times ResponseText appears as 'Target', and not-Target (value at index 1 ) - **Action (type:list)**: Value at index 0, represents number of times ResponseText appears as 'Action', and not-Action (value at index 1 ) - **Policies_Plans (type:list)**: Value at index 0, represents number of times ResponseText appears as 'Policy/Plan', and not-Policy/Plan (value at index 1 ) - **Mitigation (type:list)**: Value at index 0, represents number of times ResponseText appears in reference to Mititgation and not-Mitigation (value at index 1 ) - **Adaptation (type:list)**: Value at index 0, represents number of times ResponseText appears in reference to Adaptation and not-Adaptation (value at index 1 ) - **language (type:str)**: ISO code of language of ResponseText. - **context (type:list[str])**: List of paragraphs/textchunk from the document of country which contains the ResponseText. These results are based on Okapi bm25 retriever, and hence dont represent ground truth. - **context_lang (type:str)**: ISO code of language of ResponseText. In some cases context and ResponseText are different as annotator have provided the translated response, rather than original text from document. - **matching_words(type:list[list[[words]])**:For each context, finds the matching words from ResponseText (stopwords not considered). - **response_words(type:list[words])**:Tokens/Words from ResponseText (stopwords not considered) - **context_wordcount (type:list[int])**: Number of tokens/words in each context (remember context itself is list of multiple strings, and stopwords not considered) - **strategy (type:str)**: Can take either of *small,medium,large* value. Represents the length of paragraphs/textchunk considered for finding the right context for ResponseText - **match_onresponse (type:list[float])**: Percentage of overlapping words between Response and context with respect to the length of ResponseText. - **candidate (type:list[list[int]])**: Candidate within context which corresponds (fuzzy matching/similarity) to ResponseText. Value at index(0,1) represents (start,end) of string within context - **fetched_text (type:list[str])**: Candidate within context which corresponds (fuzzy matching/similarity) to ResponseText. - **response_translated(type:str)**:Translated ResponseText - **context_translated(type:str)**: Translated Context - **candidate_translated(type:str)**: Translated Candidate index values (check column 'candidate') - **fetched_text_translated(type:str)**: Translated Candidates (check column 'candidate') - **QA_data(type:dict)**: Metadata about ResponseText, highlighting nature of query to which ResponseText corresponds as 'answer/response' - **match_onanswer (type:list[float])**: Represents percentage match between Response and candidate text ( from statistics it is recommended to keep only values above 0.3% as answer and consider the context for 'No answer' for SQUAD2 data format)

许可证：Apache-2.0 任务类别： - 问答 - 文本分类语言： - 英语 - 法语 - 西班牙语规模类别： - 10K<n<100K 标签： - 气候 - 政策本数据集由[GIZ数据服务中心](https://www.giz.de/expertise/html/63018.html)整理编纂。其源数据集来自GIZ内部团队（IKI_Tracs）与[Climatewatchdata](https://www.climatewatchdata.org/data-explorer/historical-emissions?historical-emissions-data-sources=climate-watch&historical-emissions-gases=all-ghg&historical-emissions-regions=All%20Selected&historical-emissions-sectors=total-including-lucf%2Ctotal-including-lucf&page=1)，其中Climatewatch通过分析各国国家自主贡献预案（Intended Nationally Determined Contribution, INDC）、国家自主贡献方案（Nationally Determined Contribution, NDC）及修订/更新版NDC，来解答与气候变化相关的若干关键问题。 ## 规格 - 数据集规模：~85k - 语言：英语、法语、西班牙语 # 字段说明 - **index (type:int)**: 唯一响应ID - **ResponseText (type:str)**: 经标注的查询答案/响应文本 - **Alpha3 (type:str)**: 国家Alpha-3代码（ISO 3166标准） - **Country (type:str)**: 国家名称 - **Document (type:str)**: 提供响应的政策文档类型名称 - **IkiInfo (type: list[dict])**: 由于同一响应文本可作为不同类型查询的答案/响应，因此该字段会保留该响应每一次出现的原始信息。每个字典对象代表该响应的一次出现，并提供该次出现的全部原始元数据。若该字段为None，则表示该条目属于气候数据而非IKI Tracs数据。 - **CWInfo (type: list[dict])**: 由于同一响应文本可作为不同类型查询的答案/响应，因此该字段会保留该响应每一次出现的原始信息。每个字典对象代表该响应的一次出现，并提供该次出现的全部原始元数据。若该字段为None，则表示该条目属于IKI Tracs数据而非Climatewatch数据。 - **Source (type:list[str])**: 包含数据源名称列表 - **Target (type:list)**: 索引0处的值代表响应文本作为"目标（Target）"出现的次数，索引1处的值代表其不作为"目标"出现的次数 - **Action (type:list)**: 索引0处的值代表响应文本作为"行动（Action）"出现的次数，索引1处的值代表其不作为"行动"出现的次数 - **Policies_Plans (type:list)**: 索引0处的值代表响应文本作为"政策/计划"出现的次数，索引1处的值代表其不作为"政策/计划"出现的次数 - **Mitigation (type:list)**: 索引0处的值代表响应文本涉及气候变化减缓（Mitigation）的出现次数，索引1处的值代表其未涉及气候变化减缓的出现次数 - **Adaptation (type:list)**: 索引0处的值代表响应文本涉及气候变化适应（Adaptation）的出现次数，索引1处的值代表其未涉及气候变化适应的出现次数 - **language (type:str)**: 响应文本语言的ISO代码 - **context (type:list[str])**: 包含包含该响应文本的国家文档中的段落/文本块列表。这些结果基于Okapi BM25检索器生成，因此不代表真实基准结果。 - **context_lang (type:str)**: 上下文文本语言的ISO代码。在部分场景中，上下文与响应文本的语言可能不同，这是因为标注者提供的是翻译后的响应，而非文档中的原始文本。 - **matching_words(type:list[list[words]])**: 针对每个上下文，提取响应文本中与上下文匹配的词汇（已排除停用词）。 - **response_words(type:list[words])**: 响应文本经过分词后的词汇（已排除停用词）。 - **context_wordcount (type:list[int])**: 每个上下文的分词数量（已排除停用词。注：上下文本身为多个字符串组成的列表）。 - **strategy (type:str)**: 取值可为*small、medium、large*，代表为查找响应文本的匹配上下文所考虑的段落/文本块长度。 - **match_onresponse (type:list[float])**: 响应文本与上下文之间的重叠词汇占响应文本总词汇量的百分比。 - **candidate (type:list[list[int]])**: 上下文中与响应文本（通过模糊匹配/相似度计算）对应的候选片段。索引(0,1)处的值代表该候选片段在上下文字符串中的起始与结束位置。 - **fetched_text (type:list[str])**: 上下文中与响应文本匹配的候选片段文本。 - **response_translated(type:str)**: 响应文本的翻译版本。 - **context_translated(type:str)**: 上下文的翻译版本。 - **candidate_translated(type:str)**: 候选片段索引值的翻译版本（详见字段`candidate`）。 - **fetched_text_translated(type:str)**: 候选片段的翻译版本（详见字段`candidate`）。 - **QA_data(type:dict)**: 响应文本的元数据，用于标注该响应文本所对应的查询的性质，即该文本作为"答案/响应"所匹配的查询。 - **match_onanswer (type:list[float])**: 代表响应文本与候选文本之间的匹配百分比（根据统计分析，建议仅保留0.3%以上的匹配值作为有效答案，对于SQUAD2数据格式，可将低匹配值的上下文视为"无答案"场景）。

提供机构：

GIZ

原始信息汇总

数据集概述

基本信息

许可证: Apache-2.0
任务类别:
- 问答
- 文本分类
支持语言:
- 英语
- 法语
- 西班牙语
数据集大小: 约85,000条记录

数据集内容

列信息:
- index: 唯一响应ID（类型：整数）
- ResponseText: 对查询的注释回答/响应（类型：字符串）
- Alpha3: 国家alpha-3代码（ISO 3166）（类型：字符串）
- Country: 国家名称（类型：字符串）
- Document: 响应提供自的政策文档类型名称（类型：字符串）
- IkiInfo: 响应文本可能作为不同类型查询的答案/响应出现，因此我们为每次出现保留所有原始信息。每个字典对象代表一次响应的出现，并提供每次出现的所有原始元数据。如果为None，则表示该条目属于气候数据而非IKI Tracs数据。
- CWInfo: 响应文本可能作为不同类型查询的答案/响应出现，因此我们为每次出现保留所有原始信息。每个字典对象代表一次响应的出现，并提供每次出现的所有原始元数据。如果为None，则表示该条目属于IKI Tracs数据而非CW。
- Source: 包含源名称（类型：列表[字符串]）
- Target: 索引0的值表示响应文本作为“目标”出现的次数，而非目标（索引1的值）
- Action: 索引0的值表示响应文本作为“行动”出现的次数，而非行动（索引1的值）
- Policies_Plans: 索引0的值表示响应文本作为“政策/计划”出现的次数，而非政策/计划（索引1的值）
- Mitigation: 索引0的值表示响应文本在提及缓解和非缓解时出现的次数
- Adaptation: 索引0的值表示响应文本在提及适应和非适应时出现的次数
- language: 响应文本的语言ISO代码（类型：字符串）
- context: 包含响应文本的国家的文档中的段落/文本块列表。这些结果基于Okapi bm25检索器，因此不代表地面真相。
- context_lang: 响应文本的语言ISO代码。在某些情况下，上下文和响应文本不同，因为注释者提供了翻译的响应，而不是文档中的原始文本。
- matching_words: 对于每个上下文，找到与响应文本匹配的单词（不考虑停用词）。
- response_words: 响应文本中的令牌/单词（不考虑停用词）
- context_wordcount: 每个上下文中的令牌/单词数量（记住上下文本身是多个字符串的列表，不考虑停用词）
- strategy: 可以取small, medium, large值之一。表示为响应文本找到正确上下文所考虑的段落/文本块长度
- match_onresponse: 响应和上下文之间重叠单词相对于响应文本长度的百分比。
- candidate: 上下文中对应于响应文本的候选（模糊匹配/相似性）。索引(0,1)处的值表示（开始，结束）字符串在上下文中的位置
- fetched_text: 上下文中对应于响应文本的候选（模糊匹配/相似性）。
- response_translated: 翻译的响应文本
- context_translated: 翻译的上下文
- candidate_translated: 翻译的候选索引值（检查列candidate）
- fetched_text_translated: 翻译的候选（检查列candidate）
- QA_data: 关于响应文本的元数据，突出显示响应文本作为“答案/响应”对应的查询性质
- match_onanswer: 表示响应和候选文本之间的匹配百分比。建议仅保留0.3%以上的值作为答案，并考虑SQUAD2数据格式中上下文为“无答案”的情况。

数据集特点

专注于气候政策相关的问题回答和文本分类。
包含多语言文本，支持国际研究。
详细记录了响应文本的上下文和元数据，便于深入分析。

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集