lambada_openai
收藏魔搭社区2025-12-05 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/EleutherAI/lambada_openai
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Description
- **Repository:** [openai/gpt2](https://github.com/openai/gpt-2)
- **Paper:** Radford et al. [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf)
### Dataset Summary
This dataset is comprised of the LAMBADA test split as pre-processed by OpenAI (see relevant discussions [here](https://github.com/openai/gpt-2/issues/131#issuecomment-497136199) and [here](https://github.com/huggingface/transformers/issues/491)). It also contains machine translated versions of the split in German, Spanish, French, and Italian.
LAMBADA is used to evaluate the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative texts sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole text, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse.
### Languages
English, German, Spanish, French, and Italian.
### Source Data
For non-English languages, the data splits were produced by Google Translate. See the [`translation_script.py`](translation_script.py) for more details.
## Additional Information
### Hash Checksums
For data integrity checks we leave the following checksums for the files in this dataset:
| File Name | Checksum (SHA-256) |
|--------------------------------------------------------------------------|------------------------------------------------------------------|
| lambada_test_de.jsonl | 51c6c1795894c46e88e4c104b5667f488efe79081fb34d746b82b8caa663865e |
| [openai/lambada_test.jsonl](https://openaipublic.blob.core.windows.net/gpt-2/data/lambada_test.jsonl) | 4aa8d02cd17c719165fc8a7887fddd641f43fcafa4b1c806ca8abc31fabdb226 |
| lambada_test_en.jsonl | 4aa8d02cd17c719165fc8a7887fddd641f43fcafa4b1c806ca8abc31fabdb226 |
| lambada_test_es.jsonl | ffd760026c647fb43c67ce1bc56fd527937304b348712dce33190ea6caba6f9c |
| lambada_test_fr.jsonl | 941ec6a73dba7dc91c860bf493eb66a527cd430148827a4753a4535a046bf362 |
| lambada_test_it.jsonl | 86654237716702ab74f42855ae5a78455c1b0e50054a4593fb9c6fcf7fad0850 |
### Licensing
License: [Modified MIT](https://github.com/openai/gpt-2/blob/master/LICENSE)
### Citation
```bibtex
@article{radford2019language,
title={Language Models are Unsupervised Multitask Learners},
author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
year={2019}
}
```
```bibtex
@misc{
author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
title={The LAMBADA dataset},
DOI={10.5281/zenodo.2630551},
publisher={Zenodo},
year={2016},
month={Aug}
}
```
### Contributions
Thanks to Sid Black ([@sdtblck](https://github.com/sdtblck)) for translating the `lambada_openai` dataset into the non-English languages.
Thanks to Jonathan Tow ([@jon-tow](https://github.com/jon-tow)) for adding this dataset.
## 数据集描述
- **仓库**:[openai/gpt-2](https://github.com/openai/gpt-2)
- **论文**:Radford 等著《[语言模型是无监督多任务学习者(Language Models are Unsupervised Multitask Learners)](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf)》
### 数据集摘要
本数据集包含由 OpenAI 预处理的 LAMBADA(LAMBADA)测试集划分,相关讨论可参阅[此处](https://github.com/openai/gpt-2/issues/131#issuecomment-497136199)与[此处](https://github.com/huggingface/transformers/issues/491)。同时本数据集还包含该测试集划分的德语、西班牙语、法语及意大利语机器翻译版本。
LAMBADA 数据集用于通过单词预测任务评估计算模型的文本理解能力。该数据集由叙事文本集合构成,其共同特征为:人类受试者若阅读完整文本,则可猜出目标词的最后一个单词;但若仅查看目标词前的最后一个句子,则无法完成预测。要在 LAMBADA 任务中取得优异表现,计算模型不能仅依赖局部上下文,还需具备追踪更宽泛语篇信息的能力。
### 涉及语言
英语、德语、西班牙语、法语及意大利语。
### 源数据
对于非英语语言,其数据划分由 Google Translate 生成,详细信息可参阅 `translation_script.py` 文件。
## 附加信息
### 哈希校验和
为保障数据完整性,我们提供本数据集各文件的如下 SHA-256 校验和:
| 文件名 | SHA-256 校验值 |
|-----------------------------------------------------------------------|----------------------------------------------------------------|
| lambada_test_de.jsonl | 51c6c1795894c46e88e4c104b5667f488efe79081fb34d746b82b8caa663865e |
| [openai/lambada_test.jsonl](https://openaipublic.blob.core.windows.net/gpt-2/data/lambada_test.jsonl) | 4aa8d02cd17c719165fc8a7887fddd641f43fcafa4b1c806ca8abc31fabdb226 |
| lambada_test_en.jsonl | 4aa8d02cd17c719165fc8a7887fddd641f43fcafa4b1c806ca8abc31fabdb226 |
| lambada_test_es.jsonl | ffd760026c647fb43c67ce1bc56fd527937304b348712dce33190ea6caba6f9c |
| lambada_test_fr.jsonl | 941ec6a73dba7dc91c860bf493eb66a527cd430148827a4753a4535a046bf362 |
| lambada_test_it.jsonl | 86654237716702ab74f42855ae5a78455c1b0e50054a4593fb9c6fcf7fad0850 |
### 授权协议
授权协议:[修改版 MIT 协议(Modified MIT)](https://github.com/openai/gpt-2/blob/master/LICENSE)
### 引用
bibtex
@article{radford2019language,
title={Language Models are Unsupervised Multitask Learners},
author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
year={2019}
}
bibtex
@misc{
author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
title={The LAMBADA dataset},
DOI={10.5281/zenodo.2630551},
publisher={Zenodo},
year={2016},
month={Aug}
}
### 贡献致谢
感谢 Sid Black([@sdtblck](https://github.com/sdtblck))将 `lambada_openai` 数据集翻译为非英语版本。
感谢 Jonathan Tow([@jon-tow](https://github.com/jon-tow))为本数据集的添加提供帮助。
提供机构:
maas
创建时间:
2025-08-16



