five

nikhilchigali/wikianswers_embeddings_512

收藏
Hugging Face2024-03-30 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/nikhilchigali/wikianswers_embeddings_512
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: sentence dtype: string - name: cluster dtype: int64 - name: embedding_512 sequence: float32 splits: - name: train num_bytes: 2091790625 num_examples: 990526 download_size: 2669870589 dataset_size: 2091790625 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - sentence-similarity language: - en size_categories: - 100K<n<1M --- # Dataset Card for "wikianswers_embeddings_512" ## Dataset Summary `nikhilchigali/wikianswers_embeddings_512` is a subset of the `embedding-data/WikiAnswers` ([Link](https://huggingface.co/datasets/embedding-data/WikiAnswers)) As opposed to the original dataset with 3,386,256 rows, this dataset contains only .13% of the total rows(sets). The sets of sentences have been unraveled into individual items with corresponding cluster IDs to identify sentences from the same set. Each Sentence has its associated cluster ID and embeddings of dimension 512. ## Languages English. ## Dataset Structure Each example in the dataset contains a sentence and its cluster id of other equivalent sentences. The sentences in the same cluster are paraphrases of each other. The embeddings for the dataset are created using the `distiluse-base-multilingual-cased-v1` model. ``` {"sentence": [sentence], "cluster": [cluster_id], "embedding_512": [embeddings]} {"sentence": [sentence], "cluster": [cluster_id], "embedding_512": [embeddings]} {"sentence": [sentence], "cluster": [cluster_id], "embedding_512": [embeddings]} ... {"sentence": [sentence], "cluster": [cluster_id], "embedding_512": [embeddings]} ``` ### Usage Example Install the 🤗 Datasets library with `pip install datasets` and load the dataset from the Hub with: ```python from datasets import load_dataset dataset = load_dataset("nikhilchigali/wikianswers_embeddings_512") ``` The dataset is loaded as a DatasetDict and has the format for N examples: ```python DatasetDict({ train: Dataset({ features: ['sentence', "cluster", "embedding_512"], num_rows: N }) }) ``` Review an example i with: ```python dataset["train"][i] ``` ## Source Data `embedding-data/WikiAnswers` on HuggingFace ([Link](https://huggingface.co/datasets/embedding-data/WikiAnswers)) ### Note: This dataset is for the owner's personal use and claims no rights whatsoever.
提供机构:
nikhilchigali
原始信息汇总

数据集概述

数据集名称

nikhilchigali/wikianswers_embeddings_512

数据集来源

本数据集是embedding-data/WikiAnswers的一个子集,原始数据集包含3,386,256行,本数据集仅包含原始数据集的0.13%。

数据集结构

  • 特征:
    • sentence: 字符串类型
    • cluster: 整数类型
    • embedding_512: 浮点数序列类型,维度512
  • 数据分割:
    • train: 990526个样本,总大小2091790625字节
  • 数据集大小:
    • 下载大小: 2669870589字节
    • 数据集大小: 2091790625字节

语言

英语

任务类别

句子相似性

数据集使用示例

python from datasets import load_dataset dataset = load_dataset("nikhilchigali/wikianswers_embeddings_512")

数据集加载格式

python DatasetDict({ train: Dataset({ features: [sentence, "cluster", "embedding_512"], num_rows: N }) })

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作