five

embedding-data/SPECTER

收藏
Hugging Face2022-08-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/embedding-data/SPECTER
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en paperswithcode_id: embedding-data/SPECTER pretty_name: SPECTER task_categories: - sentence-similarity - paraphrase-mining task_ids: - semantic-similarity-classification --- # Dataset Card for "SPECTER" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://github.com/allenai/specter](https://github.com/allenai/specter) - **Repository:** [More Information Needed](https://github.com/allenai/specter/blob/master/README.md) - **Paper:** [More Information Needed](https://arxiv.org/pdf/2004.07180.pdf) - **Point of Contact:** [@armancohan](https://github.com/armancohan), [@sergeyf](https://github.com/sergeyf), [@haroldrubio](https://github.com/haroldrubio), [@jinamshah](https://github.com/jinamshah) ### Dataset Summary Dataset containing triplets (three sentences): anchor, positive, and negative. Contains titles of papers. Disclaimer: The team releasing SPECTER did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team. ## Dataset Structure Each example in the dataset contains triplets of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". Each example is a dictionary with a key, "set", containing a list of three sentences (anchor, positive, and negative): ``` {"set": [anchor, positive, negative]} {"set": [anchor, positive, negative]} ... {"set": [anchor, positive, negative]} ``` This dataset is useful for training Sentence Transformers models. Refer to the following post on how to train models using triplets. ### Usage Example Install the 🤗 Datasets library with `pip install datasets` and load the dataset from the Hub with: ```python from datasets import load_dataset dataset = load_dataset("embedding-data/SPECTER") ``` The dataset is loaded as a `DatasetDict` and has the format: ```python DatasetDict({ train: Dataset({ features: ['set'], num_rows: 684100 }) }) ``` Review an example `i` with: ```python dataset["train"][i]["set"] ``` ### Curation Rationale [More Information Needed](https://github.com/allenai/specter) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/allenai/specter) #### Who are the source language producers? [More Information Needed](https://github.com/allenai/specter) ### Annotations #### Annotation process [More Information Needed](https://github.com/allenai/specter) #### Who are the annotators? [More Information Needed](https://github.com/allenai/specter) ### Personal and Sensitive Information [More Information Needed](https://github.com/allenai/specter) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/allenai/specter) ### Discussion of Biases [More Information Needed](https://github.com/allenai/specter) ### Other Known Limitations [More Information Needed](https://github.com/allenai/specter) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/allenai/specter) ### Licensing Information [More Information Needed](https://github.com/allenai/specter) ### Citation Information ### Contributions
提供机构:
embedding-data
原始信息汇总

数据集概述:SPECTER

数据集描述

数据集总结

  • 内容: 包含三元组(三个句子):锚点句、正例句和负例句,以及论文标题。
  • 格式: 每个示例为字典格式,键为"set",值为包含三个句子的列表。

支持的任务和排行榜

  • 任务类别: 句子相似度、释义挖掘
  • 任务ID: 语义相似度分类

语言

  • 支持语言: 英语

数据集结构

数据实例

  • 结构: 每个数据实例为一个字典,包含键"set"和值为一个包含三个句子的列表。

  • 示例:

    {"set": [anchor, positive, negative]} {"set": [anchor, positive, negative]} ... {"set": [anchor, positive, negative]}

数据字段

  • 字段: "set",包含三个句子(锚点句、正例句、负例句)的列表。

数据分割

  • 示例: python DatasetDict({ train: Dataset({ features: [set], num_rows: 684100 }) })

使用示例

  • 加载数据集: python from datasets import load_dataset dataset = load_dataset("embedding-data/SPECTER")

  • 查看示例: python dataset["train"][i]["set"]

许可证信息

  • 许可证: MIT
搜集汇总
背景与挑战
背景概述
SPECTER是一个用于训练Sentence Transformers模型的英文句子三元组数据集,包含锚点、正例和负例句子,主要用于句子相似性和释义挖掘任务。该数据集以字典格式存储,训练集规模为684100个三元组示例。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作