embedding-data/SPECTER

Name: embedding-data/SPECTER
Creator: embedding-data
Published: 2022-08-02 03:45:52
License: 暂无描述

Hugging Face2022-08-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/embedding-data/SPECTER

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en paperswithcode_id: embedding-data/SPECTER pretty_name: SPECTER task_categories: - sentence-similarity - paraphrase-mining task_ids: - semantic-similarity-classification --- # Dataset Card for "SPECTER" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://github.com/allenai/specter](https://github.com/allenai/specter) - **Repository:** [More Information Needed](https://github.com/allenai/specter/blob/master/README.md) - **Paper:** [More Information Needed](https://arxiv.org/pdf/2004.07180.pdf) - **Point of Contact:** [@armancohan](https://github.com/armancohan), [@sergeyf](https://github.com/sergeyf), [@haroldrubio](https://github.com/haroldrubio), [@jinamshah](https://github.com/jinamshah) ### Dataset Summary Dataset containing triplets (three sentences): anchor, positive, and negative. Contains titles of papers. Disclaimer: The team releasing SPECTER did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team. ## Dataset Structure Each example in the dataset contains triplets of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". Each example is a dictionary with a key, "set", containing a list of three sentences (anchor, positive, and negative): ``` {"set": [anchor, positive, negative]} {"set": [anchor, positive, negative]} ... {"set": [anchor, positive, negative]} ``` This dataset is useful for training Sentence Transformers models. Refer to the following post on how to train models using triplets. ### Usage Example Install the 🤗 Datasets library with `pip install datasets` and load the dataset from the Hub with: ```python from datasets import load_dataset dataset = load_dataset("embedding-data/SPECTER") ``` The dataset is loaded as a `DatasetDict` and has the format: ```python DatasetDict({ train: Dataset({ features: ['set'], num_rows: 684100 }) }) ``` Review an example `i` with: ```python dataset["train"][i]["set"] ``` ### Curation Rationale [More Information Needed](https://github.com/allenai/specter) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/allenai/specter) #### Who are the source language producers? [More Information Needed](https://github.com/allenai/specter) ### Annotations #### Annotation process [More Information Needed](https://github.com/allenai/specter) #### Who are the annotators? [More Information Needed](https://github.com/allenai/specter) ### Personal and Sensitive Information [More Information Needed](https://github.com/allenai/specter) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/allenai/specter) ### Discussion of Biases [More Information Needed](https://github.com/allenai/specter) ### Other Known Limitations [More Information Needed](https://github.com/allenai/specter) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/allenai/specter) ### Licensing Information [More Information Needed](https://github.com/allenai/specter) ### Citation Information ### Contributions

提供机构：

embedding-data

原始信息汇总

数据集概述：SPECTER

数据集描述

数据集总结

内容: 包含三元组（三个句子）：锚点句、正例句和负例句，以及论文标题。
格式: 每个示例为字典格式，键为"set"，值为包含三个句子的列表。

支持的任务和排行榜

任务类别: 句子相似度、释义挖掘
任务ID: 语义相似度分类

语言

支持语言: 英语

数据集结构

数据实例

结构: 每个数据实例为一个字典，包含键"set"和值为一个包含三个句子的列表。
示例:

{"set": [anchor, positive, negative]} {"set": [anchor, positive, negative]} ... {"set": [anchor, positive, negative]}

数据字段

字段: "set"，包含三个句子（锚点句、正例句、负例句）的列表。

数据分割

示例: python DatasetDict({ train: Dataset({ features: [set], num_rows: 684100 }) })

使用示例

加载数据集: python from datasets import load_dataset dataset = load_dataset("embedding-data/SPECTER")
查看示例: python dataset["train"][i]["set"]

许可证信息

许可证: MIT

搜集汇总

背景与挑战

背景概述

SPECTER是一个用于训练Sentence Transformers模型的英文句子三元组数据集，包含锚点、正例和负例句子，主要用于句子相似性和释义挖掘任务。该数据集以字典格式存储，训练集规模为684100个三元组示例。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集