nikhilchigali/wikianswers_embeddings_512
收藏Hugging Face2024-03-30 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/nikhilchigali/wikianswers_embeddings_512
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: sentence
dtype: string
- name: cluster
dtype: int64
- name: embedding_512
sequence: float32
splits:
- name: train
num_bytes: 2091790625
num_examples: 990526
download_size: 2669870589
dataset_size: 2091790625
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
task_categories:
- sentence-similarity
language:
- en
size_categories:
- 100K<n<1M
---
# Dataset Card for "wikianswers_embeddings_512"
## Dataset Summary
`nikhilchigali/wikianswers_embeddings_512` is a subset of the `embedding-data/WikiAnswers` ([Link](https://huggingface.co/datasets/embedding-data/WikiAnswers))
As opposed to the original dataset with 3,386,256 rows, this dataset contains only .13% of the total rows(sets). The sets of sentences have been unraveled into individual items with corresponding cluster IDs to identify sentences from the same set. Each Sentence has its associated cluster ID and embeddings of dimension 512.
## Languages
English.
## Dataset Structure
Each example in the dataset contains a sentence and its cluster id of other equivalent sentences. The sentences in the same cluster are paraphrases of each other. The embeddings for the dataset are created using the `distiluse-base-multilingual-cased-v1` model.
```
{"sentence": [sentence], "cluster": [cluster_id], "embedding_512": [embeddings]}
{"sentence": [sentence], "cluster": [cluster_id], "embedding_512": [embeddings]}
{"sentence": [sentence], "cluster": [cluster_id], "embedding_512": [embeddings]}
...
{"sentence": [sentence], "cluster": [cluster_id], "embedding_512": [embeddings]}
```
### Usage Example
Install the 🤗 Datasets library with `pip install datasets` and load the dataset from the Hub with:
```python
from datasets import load_dataset
dataset = load_dataset("nikhilchigali/wikianswers_embeddings_512")
```
The dataset is loaded as a DatasetDict and has the format for N examples:
```python
DatasetDict({
train: Dataset({
features: ['sentence', "cluster", "embedding_512"],
num_rows: N
})
})
```
Review an example i with:
```python
dataset["train"][i]
```
## Source Data
`embedding-data/WikiAnswers` on HuggingFace ([Link](https://huggingface.co/datasets/embedding-data/WikiAnswers))
### Note: This dataset is for the owner's personal use and claims no rights whatsoever.
提供机构:
nikhilchigali
原始信息汇总
数据集概述
数据集名称
nikhilchigali/wikianswers_embeddings_512
数据集来源
本数据集是embedding-data/WikiAnswers的一个子集,原始数据集包含3,386,256行,本数据集仅包含原始数据集的0.13%。
数据集结构
- 特征:
sentence: 字符串类型cluster: 整数类型embedding_512: 浮点数序列类型,维度512
- 数据分割:
train: 990526个样本,总大小2091790625字节
- 数据集大小:
- 下载大小: 2669870589字节
- 数据集大小: 2091790625字节
语言
英语
任务类别
句子相似性
数据集使用示例
python from datasets import load_dataset dataset = load_dataset("nikhilchigali/wikianswers_embeddings_512")
数据集加载格式
python DatasetDict({ train: Dataset({ features: [sentence, "cluster", "embedding_512"], num_rows: N }) })



