lama
收藏魔搭社区2025-12-05 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/facebook/lama
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for LAMA: LAnguage Model Analysis - a dataset for probing and analyzing the factual and commonsense knowledge contained in pretrained language models.
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:**
https://github.com/facebookresearch/LAMA
- **Repository:**
https://github.com/facebookresearch/LAMA
- **Paper:**
@inproceedings{petroni2019language,
title={Language Models as Knowledge Bases?},
author={F. Petroni, T. Rockt{\"{a}}schel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu and S. Riedel},
booktitle={In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019},
year={2019}
}
@inproceedings{petroni2020how,
title={How Context Affects Language Models' Factual Predictions},
author={Fabio Petroni and Patrick Lewis and Aleksandra Piktus and Tim Rockt{\"a}schel and Yuxiang Wu and Alexander H. Miller and Sebastian Riedel},
booktitle={Automated Knowledge Base Construction},
year={2020},
url={https://openreview.net/forum?id=025X0zPfn}
}
### Dataset Summary
This dataset provides the data for LAMA. The dataset include a subset
of Google_RE
(https://code.google.com/archive/p/relation-extraction-corpus/), TRex
(subset of wikidata triples), Conceptnet
(https://github.com/commonsense/conceptnet5/wiki) and Squad. There are
configs for each of "google_re", "trex", "conceptnet" and "squad",
respectively.
The dataset includes some cleanup, and addition of a masked sentence
and associated answers for the [MASK] token. The accuracy in
predicting the [MASK] token shows how well the language model knows
facts and common sense information. The [MASK] tokens are only for the
"object" slots.
This version of the dataset includes "negated" sentences as well as
the masked sentence. Also, certain of the config includes "template"
and "template_negated" fields of the form "[X] some text [Y]", where
[X] and [Y] are the subject and object slots respectively of certain
relations.
See the paper for more details. For more information, also see:
https://github.com/facebookresearch/LAMA
### Languages
en
## Dataset Structure
### Data Instances
The trex config has the following fields:
``
{'description': 'the item (an institution, law, public office ...) or statement belongs to or has power over or applies to the value (a territorial jurisdiction: a country, state, municipality, ...)', 'label': 'applies to jurisdiction', 'masked_sentence': 'It is known as a principality as it is a monarchy headed by two Co-Princes – the Spanish/Roman Catholic Bishop of Urgell and the President of [MASK].', 'obj_label': 'France', 'obj_surface': 'France', 'obj_uri': 'Q142', 'predicate_id': 'P1001', 'sub_label': 'president of the French Republic', 'sub_surface': 'President', 'sub_uri': 'Q191954', 'template': '[X] is a legal term in [Y] .', 'template_negated': '[X] is not a legal term in [Y] .', 'type': 'N-M', 'uuid': '3fe3d4da-9df9-45ba-8109-784ce5fba38a'}
``
The conceptnet config has the following fields:
``
{'masked_sentence': 'One of the things you do when you are alive is [MASK].', 'negated': '', 'obj': 'think', 'obj_label': 'think', 'pred': 'HasSubevent', 'sub': 'alive', 'uuid': 'd4f11631dde8a43beda613ec845ff7d1'}
``
The squad config has the following fields:
``
{'id': '56be4db0acb8001400a502f0_0', 'masked_sentence': 'To emphasize the 50th anniversary of the Super Bowl the [MASK] color was used.', 'negated': "['To emphasize the 50th anniversary of the Super Bowl the [MASK] color was not used.']", 'obj_label': 'gold', 'sub_label': 'Squad'}
``
The google_re config has the following fields:
``
{'evidences': '[{\'url\': \'http://en.wikipedia.org/wiki/Peter_F._Martin\', \'snippet\': "Peter F. Martin (born 1941) is an American politician who is a Democratic member of the Rhode Island House of Representatives. He has represented the 75th District Newport since 6 January 2009. He is currently serves on the House Committees on Judiciary, Municipal Government, and Veteran\'s Affairs. During his first term of office he served on the House Committees on Small Business and Separation of Powers & Government Oversight. In August 2010, Representative Martin was appointed as a Commissioner on the Atlantic States Marine Fisheries Commission", \'considered_sentences\': [\'Peter F Martin (born 1941) is an American politician who is a Democratic member of the Rhode Island House of Representatives .\']}]', 'judgments': "[{'rater': '18349444711114572460', 'judgment': 'yes'}, {'rater': '17595829233063766365', 'judgment': 'yes'}, {'rater': '4593294093459651288', 'judgment': 'yes'}, {'rater': '7387074196865291426', 'judgment': 'yes'}, {'rater': '17154471385681223613', 'judgment': 'yes'}]", 'masked_sentence': 'Peter F Martin (born [MASK]) is an American politician who is a Democratic member of the Rhode Island House of Representatives .', 'obj': '1941', 'obj_aliases': '[]', 'obj_label': '1941', 'obj_w': 'None', 'pred': '/people/person/date_of_birth', 'sub': '/m/09gb0bw', 'sub_aliases': '[]', 'sub_label': 'Peter F. Martin', 'sub_w': 'None', 'template': '[X] (born [Y]).', 'template_negated': '[X] (not born [Y]).', 'uuid': '18af2dac-21d3-4c42-aff5-c247f245e203'}
``
### Data Fields
The trex config has the following fields:
* uuid: the id
* obj_uri: a uri for the object slot
* obj_label: a label for the object slot
* sub_uri: a uri for the subject slot
* sub_label: a label for the subject slot
* predicate_id: the predicate/relationship
* sub_surface: the surface text for the subject
* obj_surface: The surface text for the object. This is the word that should be predicted by the [MASK] token.
* masked_sentence: The masked sentence used to probe, with the object word replaced with [MASK]
* template: A pattern of text for extracting the relationship, object and subject of the form "[X] some text [Y]", where [X] and [Y] are the subject and object slots respectively. template may be missing and replaced with an empty string.
* template_negated: Same as above, except the [Y] is not the object. template_negated may be missing and replaced with empty strings.
* label: the label for the relationship/predicate. label may be missing and replaced with an empty string.
* description': a description of the relationship/predicate. description may be missing and replaced with an empty string.
* type: a type id for the relationship/predicate. type may be missing and replaced with an empty string.
The conceptnet config has the following fields:
* uuid: the id
* sub: the subject. subj may be missing and replaced with an empty string.
* obj: the object to be predicted. obj may be missing and replaced with an empty string.
* pred: the predicate/relationship
* obj_label: the object label
* masked_sentence: The masked sentence used to probe, with the object word replaced with [MASK]
* negated: same as above, except [MASK] is replaced by something that is not the object word. negated may be missing and replaced with empty strings.
The squad config has the following fields:
* id: the id
* sub_label: the subject label
* obj_label: the object label that is being predicted
* masked_sentence: The masked sentence used to probe, with the object word replaced with [MASK]
* negated: same as above, except [MASK] is replaced by something that is not the object word. negated may be missing and replaced with empty strings.
The google_re config has the following fields:
* uuid: the id
* pred: the predicate
* sub: the subject. subj may be missing and replaced with an empty string.
* obj: the object. obj may be missing and replaced with an empty string.
* evidences: flattened json string that provides evidence for predicate. parse this json string to get more 'snippet' information.
* judgments: data about judgments
* sub_q: unknown
* sub_label: label for the subject
* sub_aliases: unknown
* obj_w: unknown
* obj_label: label for the object
* obj_aliases: unknown
* masked_sentence: The masked sentence used to probe, with the object word replaced with [MASK]
* template: A pattern of text for extracting the relationship, object and subject of the form "[X] some text [Y]", where [X] and [Y] are the subject and object slots respectively.
* template_negated: Same as above, except the [Y] is not the object.
### Data Splits
There are no data splits.
## Dataset Creation
### Curation Rationale
This dataset was gathered and created to probe what language models understand.
### Source Data
#### Initial Data Collection and Normalization
See the reaserch paper and website for more detail. The dataset was
created gathered from various other datasets with cleanups for probing.
#### Who are the source language producers?
The LAMA authors and the original authors of the various configs.
### Annotations
#### Annotation process
Human annotations under the original datasets (conceptnet), and various machine annotations.
#### Who are the annotators?
Human annotations and machine annotations.
### Personal and Sensitive Information
Unkown, but likely names of famous people.
## Considerations for Using the Data
### Social Impact of Dataset
The goal for the work is to probe the understanding of language models.
### Discussion of Biases
Since the data is from human annotators, there is likely to be baises.
[More Information Needed]
### Other Known Limitations
The original documentation for the datafields are limited.
## Additional Information
### Dataset Curators
The authors of LAMA at Facebook and the authors of the original datasets.
### Licensing Information
The Creative Commons Attribution-Noncommercial 4.0 International License. see https://github.com/facebookresearch/LAMA/blob/master/LICENSE
### Citation Information
@inproceedings{petroni2019language,
title={Language Models as Knowledge Bases?},
author={F. Petroni, T. Rockt{\"{a}}schel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu and S. Riedel},
booktitle={In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019},
year={2019}
}
@inproceedings{petroni2020how,
title={How Context Affects Language Models' Factual Predictions},
author={Fabio Petroni and Patrick Lewis and Aleksandra Piktus and Tim Rockt{\"a}schel and Yuxiang Wu and Alexander H. Miller and Sebastian Riedel},
booktitle={Automated Knowledge Base Construction},
year={2020},
url={https://openreview.net/forum?id=025X0zPfn}
}
### Contributions
Thanks to [@ontocord](https://github.com/ontocord) for adding this dataset.
# LAMA数据集卡片:用于探查与分析预训练语言模型所蕴含的事实与常识知识的数据集
## 目录
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [遴选依据](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集遴选者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献者](#contributions)
## 数据集描述
- **主页**:
https://github.com/facebookresearch/LAMA
- **代码仓库**:
https://github.com/facebookresearch/LAMA
- **论文**:
bibtex
@inproceedings{petroni2019language,
title={语言模型作为知识库?},
author={F. Petroni, T. Rocktäschel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu and S. Riedel},
booktitle={In: 2019年自然语言处理经验方法会议(EMNLP)论文集},
year={2019}
}
@inproceedings{petroni2020how,
title={上下文如何影响语言模型的事实预测},
author={Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H. Miller, Sebastian Riedel},
booktitle={自动化知识库构建会议},
year={2020},
url={https://openreview.net/forum?id=025X0zPfn}
}
### 数据集摘要
本数据集为LAMA任务提供数据支持,其涵盖了Google_RE(https://code.google.com/archive/p/relation-extraction-corpus/)、TRex(维基数据三元组子集)、Conceptnet(https://github.com/commonsense/conceptnet5/wiki)以及Squad的子集。数据集分别提供了`google_re`、`trex`、`conceptnet`和`squad`四种配置项。
本数据集经过了一定的清洗处理,并为[MASK]标记添加了掩码句子与对应的答案。通过预测[MASK]标记的准确率,可以衡量语言模型对事实与常识信息的掌握程度。此处的[MASK]标记仅对应“对象槽位”。
本版本的数据集同时包含了“否定式”句子与掩码句子。部分配置项还包含了形如`[X] 部分文本 [Y]`的模板(template)与否定式模板(template_negated)字段,其中[X]与[Y]分别为特定关系的主语槽位与宾语槽位。
更多细节请参阅相关论文,如需了解更多信息可访问:https://github.com/facebookresearch/LAMA
### 语言
英语(en)
## 数据集结构
### 数据实例
TRex配置项包含以下字段:
{'description': '该条目(机构、法律、公共职位……)或陈述属于、有权管辖或适用于对应值(领土管辖范围:国家、州、直辖市……)', 'label': '适用于管辖范围', 'masked_sentence': '它被称为公国,因为它是由两位联合亲王统治的君主制国家——西班牙/罗马天主教乌赫尔主教与[MASK]的总统。', 'obj_label': '法国', 'obj_surface': '法国', 'obj_uri': 'Q142', 'predicate_id': 'P1001', 'sub_label': '法兰西共和国总统', 'sub_surface': '总统', 'sub_uri': 'Q191954', 'template': '[X]是[Y]中的法律术语。', 'template_negated': '[X]不是[Y]中的法律术语。', 'type': 'N-M', 'uuid': '3fe3d4da-9df9-45ba-8109-784ce5fba38a'}
Conceptnet配置项包含以下字段:
{'masked_sentence': '人类在世时会做的事情之一是[MASK]。', 'negated': '', 'obj': '思考', 'obj_label': '思考', 'pred': 'HasSubevent', 'sub': '活着', 'uuid': 'd4f11631dde8a43beda613ec845ff7d1'}
Squad配置项包含以下字段:
{'id': '56be4db0acb8001400a502f0_0', 'masked_sentence': '为了纪念超级碗50周年,人们使用了[MASK]色。', 'negated': "['为了纪念超级碗50周年,人们没有使用[MASK]色。']", 'obj_label': '金色', 'sub_label': 'Squad'}
Google_RE配置项包含以下字段:
{'evidences': '[{"url": "http://en.wikipedia.org/wiki/Peter_F._Martin", "snippet": "彼得·F·马丁(1941年生)是美国政治家,为罗德岛州众议院民主党议员。自2009年1月6日起,他代表第75选区纽波特地区。目前他在众议院司法、市政事务与退伍军人事务委员会任职。在其第一任期内,他曾任职于众议院小企业与分权及政府监督委员会。2010年8月,马丁议员被任命为大西洋各州海洋渔业委员会委员", "considered_sentences": ["彼得·F·马丁(1941年生)是美国政治家,为罗德岛州众议院民主党议员。"]}]', 'judgments': "[{"rater": "18349444711114572460", "judgment": "同意"}, {"rater": "17595829233063766365", "judgment": "同意"}, {"rater": "4593294093459651288", "judgment": "同意"}, {"rater": "7387074196865291426", "judgment": "同意"}, {"rater": "17154471385681223613", "judgment": "同意"}]", 'masked_sentence': '彼得·F·马丁([MASK]年生)是美国政治家,为罗德岛州众议院民主党议员。', 'obj': '1941', 'obj_aliases': '[]', 'obj_label': '1941', 'obj_w': '无', 'pred': '/people/person/date_of_birth', 'sub': '/m/09gb0bw', 'sub_aliases': '[]', 'sub_label': '彼得·F·马丁', 'sub_w': '无', 'template': '[X]([Y]年生)。', 'template_negated': '[X](并非[Y]年生)。', 'uuid': '18af2dac-21d3-4c42-aff5-c247f245e203'}
### 数据字段
TRex配置项包含以下字段:
* uuid:唯一标识符
* obj_uri:对象槽位的统一资源标识符(URI)
* obj_label:对象槽位的标签
* sub_uri:主语槽位的统一资源标识符(URI)
* sub_label:主语槽位的标签
* predicate_id:谓词/关系标识符
* sub_surface:主语的表层文本
* obj_surface:对象的表层文本,即[MASK]标记需要预测的词汇
* masked_sentence:用于探查的掩码句子,其中对象词汇已替换为[MASK]
* template:用于提取关系、主语与宾语的文本模板,格式为`[X] 部分文本 [Y]`,其中[X]与[Y]分别为主语与宾语槽位。template字段可能缺失,此时将为空字符串。
* template_negated:与template格式一致的否定式模板,仅将对象替换为非目标对象。template_negated字段可能缺失,此时将为空字符串。
* label:关系/谓词的标签,可能缺失,此时将为空字符串。
* description:关系/谓词的描述信息,可能缺失,此时将为空字符串。
* type:关系/谓词的类型标识符,可能缺失,此时将为空字符串。
Conceptnet配置项包含以下字段:
* uuid:唯一标识符
* sub:主语,可能缺失,此时将为空字符串。
* obj:需要预测的宾语,可能缺失,此时将为空字符串。
* pred:谓词/关系
* obj_label:宾语标签
* masked_sentence:用于探查的掩码句子,其中宾语词汇已替换为[MASK]
* negated:与masked_sentence格式一致的否定式句子,其中[MASK]替换为非目标宾语。negated字段可能缺失,此时将为空字符串。
Squad配置项包含以下字段:
* id:唯一标识符
* sub_label:主语标签
* obj_label:需要预测的宾语标签
* masked_sentence:用于探查的掩码句子,其中宾语词汇已替换为[MASK]
* negated:与masked_sentence格式一致的否定式句子,其中[MASK]替换为非目标宾语。negated字段可能缺失,此时将为空字符串。
Google_RE配置项包含以下字段:
* uuid:唯一标识符
* pred:谓词
* sub:主语,可能缺失,此时将为空字符串。
* obj:宾语,可能缺失,此时将为空字符串。
* evidences:提供谓词证据的扁平化JSON字符串,解析该字符串可获取更多`snippet`信息。
* judgments:标注意见数据
* sub_q:未知
* sub_label:主语标签
* sub_aliases:主语别名,未知
* obj_w:未知
* obj_label:宾语标签
* obj_aliases:宾语别名,未知
* masked_sentence:用于探查的掩码句子,其中宾语词汇已替换为[MASK]
* template:用于提取关系、主语与宾语的文本模板,格式为`[X] 部分文本 [Y]`,其中[X]与[Y]分别为主语与宾语槽位。
* template_negated:与template格式一致的否定式模板,仅将对象替换为非目标对象。
### 数据划分
本数据集无预设数据划分。
## 数据集构建
### 遴选依据
本数据集旨在探查语言模型的认知能力,故而收集并构建。
### 源数据
#### 初始数据收集与标准化
如需了解更多细节,请参阅相关研究论文与官方网站。本数据集从多个现有数据集收集而来,并针对探查任务进行了清洗处理。
#### 源语言生产者是谁?
LAMA数据集的作者与各原始配置项的原作者。
### 标注信息
#### 标注流程
原始数据集(Conceptnet)采用人工标注,其余则采用多种机器标注方式。
#### 标注者是谁?
人工标注者与机器标注程序。
### 个人与敏感信息
目前未知,但大概率包含名人姓名。
## 数据集使用注意事项
### 数据集的社会影响
本研究的目标为探查语言模型的认知能力。
### 偏差讨论
由于数据来源于人工标注者,因此可能存在标注偏差。
[更多信息待补充]
### 其他已知局限性
各数据字段的原始文档说明较为有限。
## 附加信息
### 数据集遴选者
Facebook研究院LAMA项目的作者与各原始数据集的原作者。
### 许可信息
采用知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0),详情请访问:https://github.com/facebookresearch/LAMA/blob/master/LICENSE
### 引用信息
bibtex
@inproceedings{petroni2019language,
title={语言模型作为知识库?},
author={F. Petroni, T. Rocktäschel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu and S. Riedel},
booktitle={In: 2019年自然语言处理经验方法会议(EMNLP)论文集},
year={2019}
}
@inproceedings{petroni2020how,
title={上下文如何影响语言模型的事实预测},
author={Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H. Miller, Sebastian Riedel},
booktitle={自动化知识库构建会议},
year={2020},
url={https://openreview.net/forum?id=025X0zPfn}
}
### 贡献者
感谢[@ontocord](https://github.com/ontocord)为本数据集添加的相关工作。
提供机构:
maas
创建时间:
2025-05-20



