embedding-data/SPECTER
收藏Hugging Face2022-08-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/embedding-data/SPECTER
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
paperswithcode_id: embedding-data/SPECTER
pretty_name: SPECTER
task_categories:
- sentence-similarity
- paraphrase-mining
task_ids:
- semantic-similarity-classification
---
# Dataset Card for "SPECTER"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://github.com/allenai/specter](https://github.com/allenai/specter)
- **Repository:** [More Information Needed](https://github.com/allenai/specter/blob/master/README.md)
- **Paper:** [More Information Needed](https://arxiv.org/pdf/2004.07180.pdf)
- **Point of Contact:** [@armancohan](https://github.com/armancohan), [@sergeyf](https://github.com/sergeyf), [@haroldrubio](https://github.com/haroldrubio), [@jinamshah](https://github.com/jinamshah)
### Dataset Summary
Dataset containing triplets (three sentences): anchor, positive, and negative. Contains titles of papers.
Disclaimer: The team releasing SPECTER did not upload the dataset to the Hub and did not write a dataset card.
These steps were done by the Hugging Face team.
## Dataset Structure
Each example in the dataset contains triplets of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value".
Each example is a dictionary with a key, "set", containing a list of three sentences (anchor, positive, and negative):
```
{"set": [anchor, positive, negative]}
{"set": [anchor, positive, negative]}
...
{"set": [anchor, positive, negative]}
```
This dataset is useful for training Sentence Transformers models. Refer to the following post on how to train models using triplets.
### Usage Example
Install the 🤗 Datasets library with `pip install datasets` and load the dataset from the Hub with:
```python
from datasets import load_dataset
dataset = load_dataset("embedding-data/SPECTER")
```
The dataset is loaded as a `DatasetDict` and has the format:
```python
DatasetDict({
train: Dataset({
features: ['set'],
num_rows: 684100
})
})
```
Review an example `i` with:
```python
dataset["train"][i]["set"]
```
### Curation Rationale
[More Information Needed](https://github.com/allenai/specter)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/allenai/specter)
#### Who are the source language producers?
[More Information Needed](https://github.com/allenai/specter)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/allenai/specter)
#### Who are the annotators?
[More Information Needed](https://github.com/allenai/specter)
### Personal and Sensitive Information
[More Information Needed](https://github.com/allenai/specter)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/allenai/specter)
### Discussion of Biases
[More Information Needed](https://github.com/allenai/specter)
### Other Known Limitations
[More Information Needed](https://github.com/allenai/specter)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/allenai/specter)
### Licensing Information
[More Information Needed](https://github.com/allenai/specter)
### Citation Information
### Contributions
提供机构:
embedding-data
原始信息汇总
数据集概述:SPECTER
数据集描述
数据集总结
- 内容: 包含三元组(三个句子):锚点句、正例句和负例句,以及论文标题。
- 格式: 每个示例为字典格式,键为"set",值为包含三个句子的列表。
支持的任务和排行榜
- 任务类别: 句子相似度、释义挖掘
- 任务ID: 语义相似度分类
语言
- 支持语言: 英语
数据集结构
数据实例
-
结构: 每个数据实例为一个字典,包含键"set"和值为一个包含三个句子的列表。
-
示例:
{"set": [anchor, positive, negative]} {"set": [anchor, positive, negative]} ... {"set": [anchor, positive, negative]}
数据字段
- 字段: "set",包含三个句子(锚点句、正例句、负例句)的列表。
数据分割
- 示例: python DatasetDict({ train: Dataset({ features: [set], num_rows: 684100 }) })
使用示例
-
加载数据集: python from datasets import load_dataset dataset = load_dataset("embedding-data/SPECTER")
-
查看示例: python dataset["train"][i]["set"]
许可证信息
- 许可证: MIT
搜集汇总
背景与挑战
背景概述
SPECTER是一个用于训练Sentence Transformers模型的英文句子三元组数据集,包含锚点、正例和负例句子,主要用于句子相似性和释义挖掘任务。该数据集以字典格式存储,训练集规模为684100个三元组示例。
以上内容由遇见数据集搜集并总结生成



