five

adriantheuma/raven-data

收藏
Hugging Face2024-01-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/adriantheuma/raven-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 configs: - config_name: default data_files: - split: train path: - "ott-qa/train.json" - "phrase-bank/train.json" - "tat-qa/train.json" - "template/train.json" - "wiki-sql/train.json" - split: test path: - "ott-qa/test.json" - "phrase-bank/test.json" - "tat-qa/test.json" - "template/test.json" - "wiki-sql/test.json" - split: val path: - "ott-qa/dev.json" - "phrase-bank/dev.json" - "tat-qa/dev.json" - "template/dev.json" - "wiki-sql/dev.json" --- ## Raven Dataset The dataset that we use to fine-tune Raven is composed from four distinct question-answering datasets. Two are specifically from the financial domain with the remaining being generic and incorporating questions over both tables and text. ### TAT-QA. [Table-and-Text Question](https://paperswithcode.com/dataset/tat-qa) Answering consists of 16,552 questions generated by financial experts associated with 2,757 hybrid contexts drawn from real-world financial reports. ### Financial PhraseBank [Financial PhraseBank](https://paperswithcode.com/sota/sentiment-analysis-on-financial-phrasebank) consists of 4,846 phrases derived from English financial news on listed companies in OMX Helsinki. The dataset contains phrase-level annotation by financial markets experts, that categorises each sample sentence exclusively from an investor's standpoint as either positive, negative, or neutral. ### Wiki-SQL [Wiki-SQL](https://paperswithcode.com/dataset/wikisql) consists of 80,654 manually annotated crowd sourced examples of natural language questions and corresponding SQL queries over 24,241 tables found on Wikipedia ### OTT-QA Similar to \textsc{TAT-QA}, [Open Table-and-Text Question Answering](https://paperswithcode.com/dataset/ott-qa) consists of 43,683 questions over tabular data and unstructured text across diverse domains. The majority of questions necessitate multi-hop inference involving both forms of data. ### Data preparation The datasets described above have diverse formats and are not suited for fine-tuning Raven as-is. We employ a data conversion pipeline to convert these four datasets into a homogeneous dataset suitable to fine-tune our financial model. In general, we extract up to four key attributes from the original datasets. These are (1) instruction that describes the task to perform, for example, *Determine the sentiment of the following phrase*, or the question *What is the percentage change in revenue after the adoption of ASC 606?* (2) input that provides more context such as the phrase to classify or a passage, (3) data that accompanies the context, in tabular format, (4) derivation that produces the answer or expected response. Refer to [templates](https://github.com/adriantheuma/llama2-raven/blob/main/templates/README.md) for examples of the full prompt. To obtain a balanced dataset we randomly sub-sample larger datasets such that we obtain a uniformly distributed dataset among the different sources. The size of the final training dataset is 47.6K samples, validation 5.26K and test 5.81K.
提供机构:
adriantheuma
原始信息汇总

数据集概述

数据集名称

Raven Dataset

数据集组成

  • TAT-QA: 包含16,552个问题,涉及2,757个混合上下文,主要来自金融报告。
  • Financial PhraseBank: 包含4,846个金融新闻中的短语,由金融市场专家标注,分为正、负、中立三类。
  • Wiki-SQL: 包含80,654个自然语言问题及其对应的SQL查询,涉及24,241个维基百科表格。
  • OTT-QA: 包含43,683个问题,涉及表格数据和非结构化文本,多数问题需要多跳推理。

数据集格式

原始数据集格式多样,通过数据转换管道转换为统一的格式,包含四个关键属性:

  1. 任务指令
  2. 输入上下文
  3. 伴随上下文的数据(表格格式)
  4. 答案或预期响应的推导

数据集划分

  • 训练集: 47.6K样本
  • 验证集: 5.26K样本
  • 测试集: 5.81K样本

许可证

Apache-2.0

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作