templatic_generation_tasks

Name: templatic_generation_tasks
Creator: maas
Published: 2025-11-27 16:41:43
License: 暂无描述

魔搭社区2025-11-27 更新2025-07-26 收录

下载链接：

https://modelscope.cn/datasets/microsoft/templatic_generation_tasks

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Active/Passive/Logical Transforms ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Dataset Subsets (Tasks)](#data-tasks) - [Dataset Splits](#data-splits) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** [Roland Fernandez](mailto:rfernand@microsoft.com) ### Dataset Summary This dataset is a synthetic dataset containing a set of templatic generation tasks using both English and random 2-letter words. ### Supported Tasks and Leaderboards [TBD] ### Languages All data is in English or random 2-letter words. ## Dataset Structure The dataset consists of several subsets, or tasks. Each task contains a train split, a dev split, and a test split, and multiple out-of-distribution splits. Each sample in a split contains a source string, a target string, and an annotation string (describing the sample). ### Dataset Subsets (Tasks) The dataset consists of the following tasks: ``` - 1_shot_rlw (1 example input/output pair, a test input, and the gold output, all using random 2-letter words) - 1_shot_eng (same as 1_shot_rlw but using English words). - 1_shot_rlw_10x (same as 1_shot_rlw, but with 10x the training samples) - 2_shot_rlw (2 example input/output pairs, a test input, and the gold output, all using random 2-letter words) - 3_shot_rlw (3 example input/output pairs, a test input, and the gold output, all using random 2-letter words) - 5_shot_rlw (5 example input/output pairs, a test input, and the gold output, all using random 2-letter words) - 10_shot_rtw (10 example input/output pairs, a test input, and the gold output, all using random 2-letter words) ``` ### Data Splits Most tasks have the following splits: - train - dev - test - ood_lexical - ood_cons_count_3 - ood_cons_count_5 - ood_cons_count_7 - ood_cons_count_10 - ood_cons_len_3 - ood_cons_len_5 - ood_cons_len_7 - ood_cons_len_10 Here is a table showing how the number of examples varies by split (for most tasks): | Dataset Split | Number of Instances in Split | | ------------- | ------------------------------------------- | | train | 280,000 | | dev | 35,000 | | test | 35,000 | | ood_* | 84,000 | ### Data Instances Each sample consits of a source, target, and annotation string (all tab separated). Here is an example from the *train* split of the *1_shot_eng* task: ``` { 'raw': 'Q any mouse ) ; bear A any mouse & . Q road ) ; building A road & . {"cons_count": "Q2A1", "cons_len": "Q21.Q11"}' 'source': 'Q any mouse ) ; bear A any mouse & . Q road ) ; building A', 'target': 'road & .', 'annotation': '{"cons_count": "Q2A1", "cons_len": "Q21.Q11"}' } ``` ### Data Fields - `source`: the string containing the N-shot examples and the test cue - `target`: the string containing the desired (gold) output - `annotation`: the string describing the example (as a python or JSON dictionary) ## Dataset Creation ### Curation Rationale We wanted a dataset that would test in-context (and from scratch) learning of abstract, semantic-free symbolic transformations, based on a random template for each example. The dataset is designed to test 3 types of out of distribution generalization: - lexical - known words used in new contexts (relative to train split) - length - train split uses constituents of 1, 2, or 4 words; OOD splits use 3, 5, 7, or 10 words - count - train split uses 1, 2, or 4 constituents; OOD splits use 3, 5, 7, or 10 constituents ### Source Data [N/A] #### Initial Data Collection and Normalization [N/A] #### Who are the source language producers? The dataset by generated from templates designed by Paul Smolensky and Roland Fernandez. ### Annotations Besides the source and target strings, each sample contains an annotation string that describes the sample. #### Annotation process The annotation columns were generated from each sample template. #### Who are the annotators? [N/A] ### Personal and Sensitive Information No names or other sensitive information are included in the data. ## Considerations for Using the Data ### Social Impact of Dataset The purpose of this dataset is to research how LLM and from-scratch model can learn to solve templatic generation tasks. ### Discussion of Biases [TBD] ### Other Known Limitations [TBD] ## Additional Information The internal name of this dataset is nc_tgt_v11. Also see DATASET_INFO.md and GRAMMAR.md files. ### Dataset Curators The dataset by generated from templates designed by Paul Smolensky and Roland Fernandez. ### Citation Information [TBD] ### Contributions Thanks to [The Neurocompositional AI group at Microsoft Research](https://www.microsoft.com/en-us/research/project/neurocompositional-ai/) for creating and adding this dataset.

# 主动/被动/逻辑变换数据集卡片 ## 目录 - ["数据集描述"](#数据集描述) - ["数据集概述"](#数据集概述) - ["支持的任务与排行榜"](#支持的任务与排行榜) - ["语言"](#语言) - ["数据集结构"](#数据集结构) - ["数据集子集（任务）"](#数据集子集-任务) - ["数据集切分"](#数据集切分) - ["数据示例"](#数据示例) - ["数据字段"](#数据字段) - ["数据集构建"](#数据集构建) - ["整理初衷"](#整理初衷) - ["源数据"](#源数据) - ["标注信息"](#标注信息) - ["个人与敏感信息"](#个人与敏感信息) - ["数据集使用注意事项"](#数据集使用注意事项) - ["数据集的社会影响"](#数据集的社会影响) - ["偏差讨论"](#偏差讨论) - ["其他已知局限性"](#其他已知局限性) - ["附加信息"](#附加信息) - ["数据集整理者"](#数据集整理者) - ["许可信息"](#许可信息) - ["引用信息"](#引用信息) - ["贡献"](#贡献) ## 数据集描述 - **主页：** - **代码仓库：** - **论文：** - **排行榜：** - **联系人：** [罗兰·费尔南德斯（Roland Fernandez）](mailto:rfernand@microsoft.com) ### 数据集概述本数据集为合成数据集，包含一系列基于英语与随机双字母词汇的模板生成任务。 ### 支持的任务与排行榜 [TBD] ### 语言所有数据均采用英语或随机双字母词汇编写。 ## 数据集结构本数据集包含多个子集（即任务）。每个任务均包含训练切分（train split）、开发切分（dev split）、测试切分（test split）以及多个分布外切分（out-of-distribution splits）。每个切分中的样本均包含源字符串、目标字符串与标注字符串（用于描述该样本）。 ### 数据集子集（任务）本数据集包含以下任务： - 1_shot_rlw （包含1个示例输入输出对、1个测试输入与标准答案，均采用随机双字母词汇） - 1_shot_eng （与1_shot_rlw一致，但采用英语词汇） - 1_shot_rlw_10x （与1_shot_rlw一致，但训练样本数量为其10倍） - 2_shot_rlw （包含2个示例输入输出对、1个测试输入与标准答案，均采用随机双字母词汇） - 3_shot_rlw （包含3个示例输入输出对、1个测试输入与标准答案，均采用随机双字母词汇） - 5_shot_rlw （包含5个示例输入输出对、1个测试输入与标准答案，均采用随机双字母词汇） - 10_shot_rtw （包含10个示例输入输出对、1个测试输入与标准答案，均采用随机双字母词汇） ### 数据集切分大多数任务包含以下切分： - train - dev - test - ood_lexical - ood_cons_count_3 - ood_cons_count_5 - ood_cons_count_7 - ood_cons_count_10 - ood_cons_len_3 - ood_cons_len_5 - ood_cons_len_7 - ood_cons_len_10 下表展示了各切分的样本数量（适用于大多数任务）： | 数据集切分 | 该切分下的样本数量 | | ---------- | --------------------------------- | | train | 280,000 | | dev | 35,000 | | test | 35,000 | | ood_* | 84,000 | ### 数据示例每个样本由源字符串、目标字符串与标注字符串组成（均以制表符分隔）。以下为*1_shot_eng*任务的*train*切分中的一个示例： { 'raw': 'Q any mouse ) ; bear A any mouse & . Q road ) ; building A road & . {"cons_count": "Q2A1", "cons_len": "Q21.Q11"}' 'source': 'Q any mouse ) ; bear A any mouse & . Q road ) ; building A', 'target': 'road & .', 'annotation': '{"cons_count": "Q2A1", "cons_len": "Q21.Q11"}' } ### 数据字段 - `source`：包含N-shot示例与测试提示的字符串 - `target`：包含期望（标准）输出的字符串 - `annotation`：描述该示例的字符串（采用Python或JSON字典格式） ## 数据集构建 ### 整理初衷我们期望构建一款数据集，用于测试大语言模型（Large Language Model，LLM）与从零开始训练的模型对抽象、无语义的符号变换的上下文学习（以及从头学习）能力，每个示例均基于随机模板生成。本数据集旨在测试三类分布外泛化能力： - 词汇级泛化：训练集语境中使用过的词汇在新语境中应用（相较于训练切分） - 长度级泛化：训练切分使用1、2或4个词的成分；分布外切分使用3、5、7或10个词的成分 - 数量级泛化：训练切分使用1、2或4个成分；分布外切分使用3、5、7或10个成分 ### 源数据 [N/A] #### 初始数据收集与标准化 [N/A] #### 源语言生成者是谁？本数据集由保罗·斯莫伦斯基（Paul Smolensky）与罗兰·费尔南德斯（Roland Fernandez）设计的模板生成。 ### 标注信息除源字符串与目标字符串外，每个样本均包含用于描述该样本的标注字符串。 #### 标注流程标注列由每个示例的模板生成。 #### 标注人员是谁？ [N/A] ### 个人与敏感信息本数据集未包含任何姓名或其他敏感信息。 ## 数据集使用注意事项 ### 数据集的社会影响本数据集的研发目的是研究大语言模型与从零开始训练的模型如何学习解决模板生成任务。 ### 偏差讨论 [TBD] ### 其他已知局限性 [TBD] ## 附加信息本数据集的内部名称为nc_tgt_v11。另请参阅DATASET_INFO.md与GRAMMAR.md文件。 ### 数据集整理者本数据集由保罗·斯莫伦斯基（Paul Smolensky）与罗兰·费尔南德斯（Roland Fernandez）设计的模板生成。 ### 许可信息 ### 引用信息 [TBD] ### 贡献感谢[微软研究院神经组合AI小组（The Neurocompositional AI group at Microsoft Research）](https://www.microsoft.com/en-us/research/project/neurocompositional-ai/)创建并提交本数据集。

提供机构：

maas

创建时间：

2025-07-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集