AlgorithmicResearchGroup/arxiv_nlp_intstruct

Name: AlgorithmicResearchGroup/arxiv_nlp_intstruct
Creator: AlgorithmicResearchGroup
Published: 2024-09-04 06:54:22
License: 暂无描述

Hugging Face2024-09-04 更新2025-04-08 收录

下载链接：

https://hf-mirror.com/datasets/AlgorithmicResearchGroup/arxiv_nlp_intstruct

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - no-annotation language: - en license: - cc0-1.0 multilinguality: - monolingual pretty_name: arxiv_nlp_intstruct size_categories: - 1M<n<10M source_datasets: - original task_categories: - text-generation task_ids: - language-modeling - masked-language-modeling --- # Dataset Card for "arxiv_nlp_intstruct" ### Dataset Summary The "AlgorithmicResearchGroup/arxiv_nlp_intstruct" dataset consists of question-answer pairs derived from ArXiv abstracts from the cs.CL category". Questions and answers are generated using [GPT-3.5-turbo model](https://openai.com/chatgpt) ### Languages English ## Dataset Structure ### Data Instances #### train - **Size of downloaded dataset files:** 38.4 MB An example of 'train' looks as follows. { "question": "What is the role of the light-weight cross-attention mechanism in MixEncoder?", "answer": "In the MixEncoder, the lightweight cross-attention mechanism plays a crucial role in capturing cross-modal dependencies between different modalities (such as text and image). The purpose of cross-attention is to align features from one modality with features from another modality, allowing the model to understand the correlation between them. Specifically, the lightweight cross-attention mechanism in MixEncoder enables the model to attend to relevant information in the other modality when processing a specific modality. This helps the model to exploit important semantic relationships between different modalities and improve the overall representation learning process. By incorporating cross-attention, the MixEncoder provides a way for different modalities to interact and co-adapt during training, enabling the model to effectively understand and build representations that capture the underlying relationships between text and image inputs. This boosts the performance of multimodal models in tasks such as image captioning, visual question answering, and multimodal sentiment analysis, among others." } ### Data Fields The data fields present in the dataset are as follows: - question: a string feature representing the question. - answer: a string feature representing the answer. ### Data Splits train: 36,894 question answer pairs ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data Question-answer pairs derived from [ArXiv](https://arxiv.org/) abstracts. #### Initial Data Collection and Normalization Questions and answers are generated using [GPT-3.5-turbo model](https://openai.com/chatgpt) from ArXiv abstracts from the cs.CL category". ### Annotations The dataset doesn't contain annotations. ### Personal and Sensitive Information None #### Notice policy Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. Clearly identify the copyrighted work claimed to be infringed. Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material. And contact us at the following email address: matt at artifactai.com and datasets at huggingface.co #### Take down policy The original authors will comply to legitimate requests by removing the affected sources from the next release of the corpus. Hugging Face will also update this repository accordingly. ### Citation Information ``` @misc{arxiv_nlp_intstruct, title={arxiv_nlp_intstruct}, author={Matthew Kenney}, year={2023} } ```

--- annotations_creators: - 无注释 language: - 英语 license: - CC0 1.0 multilinguality: - 单语言 pretty_name: arxiv_nlp_intstruct size_categories: - 100万<n<1000万 source_datasets: - 原始数据集 task_categories: - 文本生成 task_ids: - 语言建模 - 掩码语言建模 --- # 「arxiv_nlp_intstruct」数据集卡片 ### 数据集概述「AlgorithmicResearchGroup/arxiv_nlp_intstruct」数据集由源自计算机科学计算与语言（cs.CL）分类下的ArXiv论文摘要的问答对构成。问答对均通过[GPT-3.5-turbo模型](https://openai.com/chatgpt)生成。 ### 语言英语 ## 数据集结构 ### 数据实例 #### 训练集 - **下载的数据集文件大小：38.4 MB** 训练集的一条示例如下： json { "question": "MixEncoder中的轻量级交叉注意力机制发挥着怎样的作用？", "answer": "在MixEncoder中，轻量级交叉注意力机制的核心作用是捕获不同模态（如文本与图像）之间的跨模态依赖。交叉注意力的目的是将某一模态的特征与另一模态的特征进行对齐，使模型能够理解二者之间的关联。具体而言，MixEncoder中的轻量级交叉注意力机制允许模型在处理某一模态时，关注另一模态中的相关信息。这有助于模型挖掘不同模态间的重要语义关联，优化整体表征学习流程。通过引入交叉注意力，MixEncoder为不同模态提供了在训练过程中交互与协同适配的途径，使模型能够有效理解并构建可捕捉文本与图像输入间底层关联的表征，进而提升多模态模型在图像字幕生成、视觉问答、多模态情感分析等任务中的性能。" } ### 数据字段数据集包含以下数据字段： - `question`：字符串类型特征，代表问题。 - `answer`：字符串类型特征，代表答案。 ### 数据划分训练集：共36894条问答对 ### 数据集遴选依据 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据问答对源自[ArXiv](https://arxiv.org/)论文摘要。 #### 初始数据收集与标准化问答对均通过[GPT-3.5-turbo模型](https://openai.com/chatgpt)从cs.CL分类的ArXiv论文摘要中生成。 ### 注释信息本数据集不包含注释。 ### 个人与敏感信息无 #### 通知政策若您认为本数据集包含归属于您且不应在此处复现的内容，请执行以下操作： 1. 清晰表明您的身份，并提供详细的联系方式，例如可联系到您的地址、电话号码或电子邮箱。 2. 清晰指明被指称侵权的受版权保护的作品。 3. 清晰指明被指称侵权的内容，并提供足以让我们定位该内容的相关信息。请通过以下邮箱地址联系我们：`matt@artifactai.com` 以及 `datasets@huggingface.co` #### 下架政策数据集原作者将遵从合法请求，在该语料库的下一版本中移除受影响的内容。Hugging Face也将同步更新此仓库。 ### 引用信息 bibtex @misc{arxiv_nlp_intstruct, title={arxiv_nlp_intstruct}, author={Matthew Kenney}, year={2023} }

提供机构：

AlgorithmicResearchGroup

5,000+

优质数据集

54 个

任务类型

进入经典数据集