TucanoBR/lambada-pt
收藏Hugging Face2024-11-07 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TucanoBR/lambada-pt
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: sentence
dtype: string
- name: last_word
dtype: string
splits:
- name: train
num_bytes: 1844684
num_examples: 5153
download_size: 1241703
dataset_size: 1844684
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: mit
task_categories:
- text-generation
language:
- pt
pretty_name: LAMBADA-PT
size_categories:
- 1K<n<10K
---
# LAMBADA-PT
- **Repository:** [TucanoBR/lambada-pt](https://huggingface.co/datasets/TucanoBR/lambada-pt)
- **Paper:** Radford et al. [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf)
## Dataset Summary
This dataset is a translated version (Portuguese) of the LAMBADA test split as pre-processed by OpenAI.
LAMBADA is used to evaluate the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative texts sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole text, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse.
## Languages
Portuguese
## Licensing
License: [Modified MIT](https://github.com/openai/gpt-2/blob/master/LICENSE)
## Citation
```bibtex
@article{radford2019language,
title={Language Models are Unsupervised Multitask Learners},
author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
year={2019}
}
```
数据集信息:
特征:
- 名称:句子(sentence)
数据类型:字符串
- 名称:目标词(last_word)
数据类型:字符串
划分集:
- 名称:训练集(train)
字节数:1844684
样本数:5153
下载大小:1241703
数据集总大小:1844684
配置项:
- 配置名称:默认配置(default)
数据文件:
- 划分集:训练集
路径:data/train-*
许可证:MIT协议
任务类别:
- 文本生成(text-generation)
语言:
- 葡萄牙语(pt)
展示名称:LAMBADA-PT
样本规模分类:
- 1千<样本数<1万
---
# LAMBADA-PT
- **仓库地址**:[TucanoBR/lambada-pt](https://huggingface.co/datasets/TucanoBR/lambada-pt)
- **相关论文**:Radford 等人的[《语言模型是无监督多任务学习者》](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf)
## 数据集概述
本数据集为OpenAI预处理后的LAMBADA测试划分集的葡萄牙语译本。
LAMBADA通过词预测任务评估计算模型的文本理解能力。LAMBADA由一系列叙事文本组成,其核心特征为:人类受试者阅读完整文本后可准确猜出文本的末词,但仅查看目标词之前的最后一句话时则无法完成预测。若要在LAMBADA任务中获得理想性能,计算模型不能仅依赖局部上下文,还需具备追踪更宽泛语篇中信息的能力。
## 语言支持
葡萄牙语
## 许可证信息
许可证:[修改版MIT协议](https://github.com/openai/gpt-2/blob/master/LICENSE)
## 引用格式
bibtex
@article{radford2019language,
title={语言模型是无监督多任务学习者},
author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
year={2019}
}
提供机构:
TucanoBR



