adriantheuma/raven-data
收藏Hugging Face2024-01-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/adriantheuma/raven-data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
configs:
- config_name: default
data_files:
- split: train
path:
- "ott-qa/train.json"
- "phrase-bank/train.json"
- "tat-qa/train.json"
- "template/train.json"
- "wiki-sql/train.json"
- split: test
path:
- "ott-qa/test.json"
- "phrase-bank/test.json"
- "tat-qa/test.json"
- "template/test.json"
- "wiki-sql/test.json"
- split: val
path:
- "ott-qa/dev.json"
- "phrase-bank/dev.json"
- "tat-qa/dev.json"
- "template/dev.json"
- "wiki-sql/dev.json"
---
## Raven Dataset
The dataset that we use to fine-tune Raven is composed from four distinct question-answering datasets. Two are specifically from the financial domain with the remaining being generic and incorporating questions over both tables and text.
### TAT-QA.
[Table-and-Text Question](https://paperswithcode.com/dataset/tat-qa) Answering consists of 16,552 questions generated by financial experts associated with 2,757 hybrid contexts drawn from real-world financial reports.
### Financial PhraseBank
[Financial PhraseBank](https://paperswithcode.com/sota/sentiment-analysis-on-financial-phrasebank) consists of 4,846 phrases derived from English financial news on listed companies in OMX Helsinki. The dataset contains phrase-level annotation by financial markets experts, that categorises each sample sentence exclusively from an investor's standpoint as either positive, negative, or neutral.
### Wiki-SQL
[Wiki-SQL](https://paperswithcode.com/dataset/wikisql) consists of 80,654 manually annotated crowd sourced examples of natural language questions and corresponding SQL queries over 24,241 tables found on Wikipedia
### OTT-QA
Similar to \textsc{TAT-QA}, [Open Table-and-Text Question Answering](https://paperswithcode.com/dataset/ott-qa) consists of 43,683 questions over tabular data and unstructured text across diverse domains. The majority of questions necessitate multi-hop inference involving both forms of data.
### Data preparation
The datasets described above have diverse formats and are not suited for fine-tuning Raven as-is. We employ a data conversion pipeline to convert these four datasets into a homogeneous dataset suitable to fine-tune our financial model. In general, we extract up to four key attributes from the original datasets. These are (1) instruction that describes the task to perform, for example, *Determine the sentiment of the following phrase*, or the question *What is the percentage change in revenue after the adoption of ASC 606?* (2) input that provides more context such as the phrase to classify or a passage, (3) data that accompanies the context, in tabular format, (4) derivation that produces the answer or expected response. Refer to [templates](https://github.com/adriantheuma/llama2-raven/blob/main/templates/README.md) for examples of the full prompt.
To obtain a balanced dataset we randomly sub-sample larger datasets such that we obtain a uniformly distributed dataset among the different sources. The size of the final training dataset is 47.6K samples, validation 5.26K and test 5.81K.
提供机构:
adriantheuma
原始信息汇总
数据集概述
数据集名称
Raven Dataset
数据集组成
- TAT-QA: 包含16,552个问题,涉及2,757个混合上下文,主要来自金融报告。
- Financial PhraseBank: 包含4,846个金融新闻中的短语,由金融市场专家标注,分为正、负、中立三类。
- Wiki-SQL: 包含80,654个自然语言问题及其对应的SQL查询,涉及24,241个维基百科表格。
- OTT-QA: 包含43,683个问题,涉及表格数据和非结构化文本,多数问题需要多跳推理。
数据集格式
原始数据集格式多样,通过数据转换管道转换为统一的格式,包含四个关键属性:
- 任务指令
- 输入上下文
- 伴随上下文的数据(表格格式)
- 答案或预期响应的推导
数据集划分
- 训练集: 47.6K样本
- 验证集: 5.26K样本
- 测试集: 5.81K样本
许可证
Apache-2.0



