Effyis/Table-Extraction
收藏Table Extract Dataset
概述
该数据集旨在评估大型语言模型(LLMs)从文本中提取表格的能力。它提供了一系列包含表格的文本片段及其对应的JSON格式结构化表示。
来源
数据集基于Table Fact Dataset(也称为TabFact),包含从维基百科提取的16,573个表格。
结构
每个数据点包含两个元素:
- context:包含嵌入表格的文本片段。
- answer:表示提取的表格结构的JSON对象。
JSON对象格式如下: json { "column_1": { "row_id": "val1", "row_id": "val2", ... }, "column_2": { "row_id": "val1", "row_id": "val2", ... }, ... }
每个键代表一个列标题,对应的值是包含每列中各行键值对的对象。
示例
示例1
Context:

Answer:
json { "date": { "0": "1st", "1": "3rd", "2": "4th", "3": "11th", "4": "17th", "5": "24th", "6": "25th" }, "opponent": { "0": "bracknell bees", "1": "slough jets", "2": "slough jets", "3": "wightlink raiders", "4": "romford raiders", "5": "swindon wildcats", "6": "swindon wildcats" }, "venue": { "0": "home", "1": "away", "2": "home", "3": "home", "4": "home", "5": "away", "6": "home" }, "result": { "0": "won 4 - 1", "1": "won 7 - 3", "2": "lost 5 - 3", "3": "won 7 - 2", "4": "lost 3 - 4", "5": "lost 2 - 4", "6": "won 8 - 2" }, "attendance": { "0": 1753, "1": 751, "2": 1421, "3": 1552, "4": 1535, "5": 902, "6": 2124 }, "competition": { "0": "league", "1": "league", "2": "league", "3": "league", "4": "league", "5": "league", "6": "league" }, "man of the match": { "0": "martin bouz", "1": "joe watkins", "2": "nick cross", "3": "neil liddiard", "4": "stuart potts", "5": "lukas smital", "6": "vaclav zavoral" } }
示例2
Context:

Answer:
json { "country": { "exonym": { "0": "iceland", "1": "indonesia", "2": "iran", "3": "iraq", "4": "ireland", "5": "isle of man" }, "endonym": { "0": "ísland", "1": "indonesia", "2": "īrān ایران", "3": "al - iraq العراق îraq", "4": "éire ireland", "5": "isle of man ellan vannin" } }, "capital": { "exonym": { "0": "reykjavík", "1": "jakarta", "2": "tehran", "3": "baghdad", "4": "dublin", "5": "douglas" }, "endonym": { "0": "reykjavík", "1": "jakarta", "2": "tehrān تهران", "3": "baghdad بغداد bexda", "4": "baile átha cliath dublin", "5": "douglas doolish" } }, "official or native language(s) (alphabet/script)": { "0": "icelandic", "1": "bahasa indonesia", "2": "persian ( arabic script )", "3": "arabic ( arabic script ) kurdish", "4": "irish english", "5": "english manx" } }



