adriantheuma/raven-data

Name: adriantheuma/raven-data
Creator: adriantheuma
Published: 2024-01-24 10:57:16
License: 暂无描述

Hugging Face2024-01-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/adriantheuma/raven-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 configs: - config_name: default data_files: - split: train path: - "ott-qa/train.json" - "phrase-bank/train.json" - "tat-qa/train.json" - "template/train.json" - "wiki-sql/train.json" - split: test path: - "ott-qa/test.json" - "phrase-bank/test.json" - "tat-qa/test.json" - "template/test.json" - "wiki-sql/test.json" - split: val path: - "ott-qa/dev.json" - "phrase-bank/dev.json" - "tat-qa/dev.json" - "template/dev.json" - "wiki-sql/dev.json" --- ## Raven Dataset The dataset that we use to fine-tune Raven is composed from four distinct question-answering datasets. Two are specifically from the financial domain with the remaining being generic and incorporating questions over both tables and text. ### TAT-QA. [Table-and-Text Question](https://paperswithcode.com/dataset/tat-qa) Answering consists of 16,552 questions generated by financial experts associated with 2,757 hybrid contexts drawn from real-world financial reports. ### Financial PhraseBank [Financial PhraseBank](https://paperswithcode.com/sota/sentiment-analysis-on-financial-phrasebank) consists of 4,846 phrases derived from English financial news on listed companies in OMX Helsinki. The dataset contains phrase-level annotation by financial markets experts, that categorises each sample sentence exclusively from an investor's standpoint as either positive, negative, or neutral. ### Wiki-SQL [Wiki-SQL](https://paperswithcode.com/dataset/wikisql) consists of 80,654 manually annotated crowd sourced examples of natural language questions and corresponding SQL queries over 24,241 tables found on Wikipedia ### OTT-QA Similar to \textsc{TAT-QA}, [Open Table-and-Text Question Answering](https://paperswithcode.com/dataset/ott-qa) consists of 43,683 questions over tabular data and unstructured text across diverse domains. The majority of questions necessitate multi-hop inference involving both forms of data. ### Data preparation The datasets described above have diverse formats and are not suited for fine-tuning Raven as-is. We employ a data conversion pipeline to convert these four datasets into a homogeneous dataset suitable to fine-tune our financial model. In general, we extract up to four key attributes from the original datasets. These are (1) instruction that describes the task to perform, for example, *Determine the sentiment of the following phrase*, or the question *What is the percentage change in revenue after the adoption of ASC 606?* (2) input that provides more context such as the phrase to classify or a passage, (3) data that accompanies the context, in tabular format, (4) derivation that produces the answer or expected response. Refer to [templates](https://github.com/adriantheuma/llama2-raven/blob/main/templates/README.md) for examples of the full prompt. To obtain a balanced dataset we randomly sub-sample larger datasets such that we obtain a uniformly distributed dataset among the different sources. The size of the final training dataset is 47.6K samples, validation 5.26K and test 5.81K.

提供机构：

adriantheuma

原始信息汇总

数据集概述

数据集名称

Raven Dataset

数据集组成

TAT-QA: 包含16,552个问题，涉及2,757个混合上下文，主要来自金融报告。
Financial PhraseBank: 包含4,846个金融新闻中的短语，由金融市场专家标注，分为正、负、中立三类。
Wiki-SQL: 包含80,654个自然语言问题及其对应的SQL查询，涉及24,241个维基百科表格。
OTT-QA: 包含43,683个问题，涉及表格数据和非结构化文本，多数问题需要多跳推理。

数据集格式

原始数据集格式多样，通过数据转换管道转换为统一的格式，包含四个关键属性：

任务指令
输入上下文
伴随上下文的数据（表格格式）
答案或预期响应的推导

数据集划分

训练集: 47.6K样本
验证集: 5.26K样本
测试集: 5.81K样本

许可证

Apache-2.0

5,000+

优质数据集

54 个

任务类型

进入经典数据集