AlgorithmicResearchGroup/arxiv_nlp_intstruct
收藏Hugging Face2024-09-04 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/AlgorithmicResearchGroup/arxiv_nlp_intstruct
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language:
- en
license:
- cc0-1.0
multilinguality:
- monolingual
pretty_name: arxiv_nlp_intstruct
size_categories:
- 1M<n<10M
source_datasets:
- original
task_categories:
- text-generation
task_ids:
- language-modeling
- masked-language-modeling
---
# Dataset Card for "arxiv_nlp_intstruct"
### Dataset Summary
The "AlgorithmicResearchGroup/arxiv_nlp_intstruct" dataset consists of question-answer pairs derived from ArXiv abstracts from the cs.CL category".
Questions and answers are generated using [GPT-3.5-turbo model](https://openai.com/chatgpt)
### Languages
English
## Dataset Structure
### Data Instances
#### train
- **Size of downloaded dataset files:** 38.4 MB
An example of 'train' looks as follows.
{
"question": "What is the role of the light-weight cross-attention mechanism in MixEncoder?",
"answer": "In the MixEncoder, the lightweight cross-attention mechanism plays a crucial role
in capturing cross-modal dependencies between different modalities (such as text and image).
The purpose of cross-attention is to align features from one modality with features from another modality,
allowing the model to understand the correlation between them. Specifically, the lightweight
cross-attention mechanism in MixEncoder enables the model to attend to relevant information in the other modality
when processing a specific modality. This helps the model to exploit important semantic relationships between
different modalities and improve the overall representation learning process. By incorporating cross-attention,
the MixEncoder provides a way for different modalities to interact and co-adapt during training, enabling the model
to effectively understand and build representations that capture the underlying relationships between text and image inputs.
This boosts the performance of multimodal models in tasks such as image captioning, visual question answering,
and multimodal sentiment analysis, among others."
}
### Data Fields
The data fields present in the dataset are as follows:
- question: a string feature representing the question.
- answer: a string feature representing the answer.
### Data Splits
train: 36,894 question answer pairs
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
Question-answer pairs derived from [ArXiv](https://arxiv.org/) abstracts.
#### Initial Data Collection and Normalization
Questions and answers are generated using [GPT-3.5-turbo model](https://openai.com/chatgpt) from ArXiv abstracts from the cs.CL category".
### Annotations
The dataset doesn't contain annotations.
### Personal and Sensitive Information
None
#### Notice policy
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
Clearly identify the copyrighted work claimed to be infringed.
Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
And contact us at the following email address: matt at artifactai.com and datasets at huggingface.co
#### Take down policy
The original authors will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Hugging Face will also update this repository accordingly.
### Citation Information
```
@misc{arxiv_nlp_intstruct,
title={arxiv_nlp_intstruct},
author={Matthew Kenney},
year={2023}
}
```
---
annotations_creators:
- 无注释
language:
- 英语
license:
- CC0 1.0
multilinguality:
- 单语言
pretty_name: arxiv_nlp_intstruct
size_categories:
- 100万<n<1000万
source_datasets:
- 原始数据集
task_categories:
- 文本生成
task_ids:
- 语言建模
- 掩码语言建模
---
# 「arxiv_nlp_intstruct」数据集卡片
### 数据集概述
「AlgorithmicResearchGroup/arxiv_nlp_intstruct」数据集由源自计算机科学计算与语言(cs.CL)分类下的ArXiv论文摘要的问答对构成。问答对均通过[GPT-3.5-turbo模型](https://openai.com/chatgpt)生成。
### 语言
英语
## 数据集结构
### 数据实例
#### 训练集
- **下载的数据集文件大小:38.4 MB**
训练集的一条示例如下:
json
{
"question": "MixEncoder中的轻量级交叉注意力机制发挥着怎样的作用?",
"answer": "在MixEncoder中,轻量级交叉注意力机制的核心作用是捕获不同模态(如文本与图像)之间的跨模态依赖。交叉注意力的目的是将某一模态的特征与另一模态的特征进行对齐,使模型能够理解二者之间的关联。具体而言,MixEncoder中的轻量级交叉注意力机制允许模型在处理某一模态时,关注另一模态中的相关信息。这有助于模型挖掘不同模态间的重要语义关联,优化整体表征学习流程。通过引入交叉注意力,MixEncoder为不同模态提供了在训练过程中交互与协同适配的途径,使模型能够有效理解并构建可捕捉文本与图像输入间底层关联的表征,进而提升多模态模型在图像字幕生成、视觉问答、多模态情感分析等任务中的性能。"
}
### 数据字段
数据集包含以下数据字段:
- `question`:字符串类型特征,代表问题。
- `answer`:字符串类型特征,代表答案。
### 数据划分
训练集:共36894条问答对
### 数据集遴选依据
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
问答对源自[ArXiv](https://arxiv.org/)论文摘要。
#### 初始数据收集与标准化
问答对均通过[GPT-3.5-turbo模型](https://openai.com/chatgpt)从cs.CL分类的ArXiv论文摘要中生成。
### 注释信息
本数据集不包含注释。
### 个人与敏感信息
无
#### 通知政策
若您认为本数据集包含归属于您且不应在此处复现的内容,请执行以下操作:
1. 清晰表明您的身份,并提供详细的联系方式,例如可联系到您的地址、电话号码或电子邮箱。
2. 清晰指明被指称侵权的受版权保护的作品。
3. 清晰指明被指称侵权的内容,并提供足以让我们定位该内容的相关信息。
请通过以下邮箱地址联系我们:`matt@artifactai.com` 以及 `datasets@huggingface.co`
#### 下架政策
数据集原作者将遵从合法请求,在该语料库的下一版本中移除受影响的内容。Hugging Face也将同步更新此仓库。
### 引用信息
bibtex
@misc{arxiv_nlp_intstruct,
title={arxiv_nlp_intstruct},
author={Matthew Kenney},
year={2023}
}
提供机构:
AlgorithmicResearchGroup



