unpredictable/unpredictable_support-google-com

Name: unpredictable/unpredictable_support-google-com
Creator: unpredictable
Published: 2022-08-28 18:25:26
License: 暂无描述

Hugging Face2022-08-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/unpredictable/unpredictable_support-google-com

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - no-annotation language_creators: - found language: - en license: - apache-2.0 multilinguality: - monolingual pretty_name: UnpredicTable-support-google-com size_categories: - 100K<n<1M source_datasets: [] task_categories: - multiple-choice - question-answering - zero-shot-classification - text2text-generation - table-question-answering - text-generation - text-classification - tabular-classification task_ids: - multiple-choice-qa - extractive-qa - open-domain-qa - closed-domain-qa - closed-book-qa - open-book-qa - language-modeling - multi-class-classification - natural-language-inference - topic-classification - multi-label-classification - tabular-multi-class-classification - tabular-multi-label-classification --- # Dataset Card for "UnpredicTable-support-google-com" - Dataset of Few-shot Tasks from Tables ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Repository:** https://github.com/AnonCodeShare/few-shot-adaptation - **Paper:** Few-shot Adaptation Works with UnpredicTable Data ### Dataset Summary The UnpredicTable dataset consists of web tables formatted as few-shot tasks for fine-tuning language models to improve their few-shot performance. There are several dataset versions available: * [UnpredicTable-full](https://huggingface.co/datasets/unpredictable/unpredictable_full): Starting from the initial WTC corpus of 50M tables, we apply our tables-to-tasks procedure to produce our resulting dataset, [UnpredicTable-full](https://huggingface.co/datasets/unpredictable/unpredictable_full), which comprises 413,299 tasks from 23,744 unique websites. * [UnpredicTable-unique](https://huggingface.co/datasets/unpredictable/unpredictable_unique): This is the same as [UnpredicTable-full](https://huggingface.co/datasets/unpredictable/unpredictable_full) but filtered to have a maximum of one task per website. [UnpredicTable-unique](https://huggingface.co/datasets/unpredictable/unpredictable_unique) contains exactly 23,744 tasks from 23,744 websites. * [UnpredicTable-5k](https://huggingface.co/datasets/unpredictable/unpredictable_5k): This dataset contains 5k random tables from the full dataset. * UnpredicTable data subsets based on the website of origin: * [UnpredicTable-support-google-com](https://huggingface.co/datasets/unpredictable/unpredictable_support-google-com) ### Supported Tasks and Leaderboards Since the tables come from the web, the distribution of tasks and topics is very broad. The shape of our dataset is very wide, i.e., we have 1000's of tasks, while each task has only a few examples, compared to most current NLP datasets which are very deep, i.e., 10s of tasks with many examples. This implies that our dataset covers a broad range of potential tasks, e.g., multiple-choice, question-answering, table-question-answering, text-classification, etc. The intended use of this dataset is to improve few-shot performance by fine-tuning/pre-training on our dataset. ### Languages English ## Dataset Structure ### Data Instances Each task is represented as a jsonline file and consists of several few-shot examples. Each example is a dictionary containing a field 'task', which identifies the task, followed by an 'input', 'options', and 'output' field. The 'input' field contains several column elements of the same row in the table, while the 'output' field is a target which represents an individual column of the same row. Each task contains several such examples which can be concatenated as a few-shot task. In the case of multiple choice classification, the 'options' field contains the possible classes that a model needs to choose from. There are also additional meta-data fields such as 'pageTitle', 'title', 'outputColName', 'url', 'wdcFile'. ### Data Fields 'task': task identifier 'input': column elements of a specific row in the table. 'options': for multiple choice classification, it provides the options to choose from. 'output': target column element of the same row as input. 'pageTitle': the title of the page containing the table. 'outputColName': output column name 'url': url to the website containing the table 'wdcFile': WDC Web Table Corpus file ### Data Splits The UnpredicTable datasets do not come with additional data splits. ## Dataset Creation ### Curation Rationale Few-shot training on multi-task datasets has been demonstrated to improve language models' few-shot learning (FSL) performance on new tasks, but it is unclear which training tasks lead to effective downstream task adaptation. Few-shot learning datasets are typically produced with expensive human curation, limiting the scale and diversity of the training tasks available to study. As an alternative source of few-shot data, we automatically extract 413,299 tasks from diverse internet tables. We provide this as a research resource to investigate the relationship between training data and few-shot learning. ### Source Data #### Initial Data Collection and Normalization We use internet tables from the English-language Relational Subset of the WDC Web Table Corpus 2015 (WTC). The WTC dataset tables were extracted from the July 2015 Common Crawl web corpus (http://webdatacommons.org/webtables/2015/EnglishStatistics.html). The dataset contains 50,820,165 tables from 323,160 web domains. We then convert the tables into few-shot learning tasks. Please see our publication for more details on the data collection and conversion pipeline. #### Who are the source language producers? The dataset is extracted from [WDC Web Table Corpora](http://webdatacommons.org/webtables/). ### Personal and Sensitive Information The data was extracted from [WDC Web Table Corpora](http://webdatacommons.org/webtables/), which in turn extracted tables from the [Common Crawl](https://commoncrawl.org/). We did not filter the data in any way. Thus any user identities or otherwise sensitive information (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history, etc.) might be contained in our dataset. ## Considerations for Using the Data ### Social Impact of Dataset This dataset is intended for use as a research resource to investigate the relationship between training data and few-shot learning. As such, it contains high- and low-quality data, as well as diverse content that may be untruthful or inappropriate. Without careful investigation, it should not be used for training models that will be deployed for use in decision-critical or user-facing situations. ### Discussion of Biases Since our dataset contains tables that are scraped from the web, it will also contain many toxic, racist, sexist, and otherwise harmful biases and texts. We have not run any analysis on the biases prevalent in our datasets. Neither have we explicitly filtered the content. This implies that a model trained on our dataset may potentially reflect harmful biases and toxic text that exist in our dataset. ### Other Known Limitations No additional known limitations. ## Additional Information ### Licensing Information Apache 2.0

提供机构：

unpredictable

原始信息汇总

数据集概述：UnpredicTable-support-google-com

数据集描述

数据集摘要

UnpredicTable-support-google-com 数据集包含来自网络表格的少量样本任务，用于微调语言模型以提高其少量样本性能。该数据集包含多个版本，包括：

UnpredicTable-full: 从初始的WTC语料库中提取50M表格，通过表格到任务的转换过程生成，包含413,299个任务。
UnpredicTable-unique: 与UnpredicTable-full相同，但每个网站最多包含一个任务。
UnpredicTable-5k: 从完整数据集中随机抽取5,000个表格。
UnpredicTable-support-google-com: 基于网站来源的子集。

支持的任务和排行榜

数据集支持多种任务，包括多项选择、问答、零样本分类、文本到文本生成、表格问答、文本生成、文本分类和表格分类等。

语言

数据集仅包含英语。

数据集结构

数据实例

每个任务以jsonline文件形式表示，包含多个少量样本示例。每个示例为一个字典，包含task（任务标识）、input（输入列元素）、options（选项）和output（输出目标）等字段。

数据字段

task: 任务标识
input: 表格中特定行的列元素
options: 多选项分类的选择项
output: 与输入行对应的输出目标列元素
pageTitle: 包含表格的页面标题
outputColName: 输出列名
url: 包含表格的网站URL
wdcFile: WDC Web Table Corpus文件

数据分割

数据集未提供额外的数据分割。

数据集创建

来源数据

数据集从WDC Web Table Corpus 2015的英语关系子集中自动提取任务，该子集包含50,820,165个表格。

个人和敏感信息

数据集可能包含未经过滤的个人和敏感信息，因为数据直接从WDC Web Table Corpus提取。

使用数据的考虑

社会影响

数据集主要用于研究目的，可能包含高质量和低质量数据，以及可能不真实或不适当的内容。在使用时应谨慎，避免用于决策关键或面向用户的场景。

偏见讨论

数据集可能包含来自网络的偏见内容，如种族主义、性别歧视等，未经分析和过滤。因此，模型训练可能反映这些有害偏见。

5,000+

优质数据集

54 个

任务类型

进入经典数据集