five

Effyis/Table-Extraction

收藏
Hugging Face2024-05-31 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/Effyis/Table-Extraction
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集旨在评估大型语言模型(LLMs)从文本中提取表格的能力。它提供了一系列包含表格的文本片段及其对应的JSON格式的结构化表示。数据集基于Table Fact Dataset(也称为TabFact),该数据集包含从维基百科提取的16,573个表格。每个数据点由两部分组成:包含嵌入表格的文本片段(context)和表示提取的表格结构的JSON对象(answer)。JSON对象的格式详细说明了每个列标题及其对应的行数据。
提供机构:
Effyis
原始信息汇总

Table Extract Dataset

概述

该数据集旨在评估大型语言模型(LLMs)从文本中提取表格的能力。它提供了一系列包含表格的文本片段及其对应的JSON格式结构化表示。

来源

数据集基于Table Fact Dataset(也称为TabFact),包含从维基百科提取的16,573个表格。

结构

每个数据点包含两个元素:

  • context:包含嵌入表格的文本片段。
  • answer:表示提取的表格结构的JSON对象。

JSON对象格式如下: json { "column_1": { "row_id": "val1", "row_id": "val2", ... }, "column_2": { "row_id": "val1", "row_id": "val2", ... }, ... }

每个键代表一个列标题,对应的值是包含每列中各行键值对的对象。

示例

示例1

Context:

example1

Answer:

json { "date": { "0": "1st", "1": "3rd", "2": "4th", "3": "11th", "4": "17th", "5": "24th", "6": "25th" }, "opponent": { "0": "bracknell bees", "1": "slough jets", "2": "slough jets", "3": "wightlink raiders", "4": "romford raiders", "5": "swindon wildcats", "6": "swindon wildcats" }, "venue": { "0": "home", "1": "away", "2": "home", "3": "home", "4": "home", "5": "away", "6": "home" }, "result": { "0": "won 4 - 1", "1": "won 7 - 3", "2": "lost 5 - 3", "3": "won 7 - 2", "4": "lost 3 - 4", "5": "lost 2 - 4", "6": "won 8 - 2" }, "attendance": { "0": 1753, "1": 751, "2": 1421, "3": 1552, "4": 1535, "5": 902, "6": 2124 }, "competition": { "0": "league", "1": "league", "2": "league", "3": "league", "4": "league", "5": "league", "6": "league" }, "man of the match": { "0": "martin bouz", "1": "joe watkins", "2": "nick cross", "3": "neil liddiard", "4": "stuart potts", "5": "lukas smital", "6": "vaclav zavoral" } }

示例2

Context:

example2

Answer:

json { "country": { "exonym": { "0": "iceland", "1": "indonesia", "2": "iran", "3": "iraq", "4": "ireland", "5": "isle of man" }, "endonym": { "0": "ísland", "1": "indonesia", "2": "īrān ایران", "3": "al - iraq العراق îraq", "4": "éire ireland", "5": "isle of man ellan vannin" } }, "capital": { "exonym": { "0": "reykjavík", "1": "jakarta", "2": "tehran", "3": "baghdad", "4": "dublin", "5": "douglas" }, "endonym": { "0": "reykjavík", "1": "jakarta", "2": "tehrān تهران", "3": "baghdad بغداد bexda", "4": "baile átha cliath dublin", "5": "douglas doolish" } }, "official or native language(s) (alphabet/script)": { "0": "icelandic", "1": "bahasa indonesia", "2": "persian ( arabic script )", "3": "arabic ( arabic script ) kurdish", "4": "irish english", "5": "english manx" } }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作