aisuko/quora_questions
收藏Hugging Face2024-02-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/aisuko/quora_questions
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
---
# Overview
Original from the sentences-transformers library.
Only for researching purposes.
Adapter by Aisuko
# Installation
```python
!pip install sentence-transformers==2.3.1
```
# Computing Embeddings for a large set of sentences
```python
import os
import csv
import time
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import http_get
if __name__=='__main__':
url='http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv'
dataset_path='quora_duplicate_questions.tsv'
# max_corpus_size=50000 # max number of sentences to deal with
if not os.path.exists(dataset_path):
http_get(url, dataset_path)
# get all unique sentences from the file
corpus_sentences=set()
with open(dataset_path, encoding='utf8') as fIn:
reader=csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
for row in reader:
corpus_sentences.add(row['question1'])
corpus_sentences.add(row['question2'])
# if len(corpus_sentences)>=max_corpus_size:
# break
corpus_sentences=list(corpus_sentences)
model=SentenceTransformer('all-MiniLM-L6-v2').to('cuda')
model.max_seq_length=256
pool=model.start_multi_process_pool()
# computing the embeddings using the multi-process pool
emb=model.encode_multi_process(corpus_sentences, pool,batch_size=128,chunk_size=1024,normalize_embeddings=True)
print('Embeddings computed. Shape:', emb.shape)
# optional : stop the processes in the pool
model.stop_multi_process_pool(pool)
```
# Save the csv file
```python
import pandas as pd
corpus_embedding=pd.DataFrame(emb)
corpus_embedding.to_csv('quora_questions.csv',index=False)
```
提供机构:
aisuko
原始信息汇总
数据集概述
该数据集源自 sentences-transformers 库,仅用于研究目的。
安装
python !pip install sentence-transformers==2.3.1
计算大量句子的嵌入
python import os import csv import time
from sentence_transformers import SentenceTransformer from sentence_transformers.util import http_get
if name==main: url=http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv dataset_path=quora_duplicate_questions.tsv
if not os.path.exists(dataset_path):
http_get(url, dataset_path)
corpus_sentences=set()
with open(dataset_path, encoding=utf8) as fIn:
reader=csv.DictReader(fIn, delimiter= , quoting=csv.QUOTE_MINIMAL)
for row in reader:
corpus_sentences.add(row[question1])
corpus_sentences.add(row[question2])
corpus_sentences=list(corpus_sentences)
model=SentenceTransformer(all-MiniLM-L6-v2).to(cuda)
model.max_seq_length=256
pool=model.start_multi_process_pool()
emb=model.encode_multi_process(corpus_sentences, pool,batch_size=128,chunk_size=1024,normalize_embeddings=True)
print(Embeddings computed. Shape:, emb.shape)
model.stop_multi_process_pool(pool)
保存CSV文件
python import pandas as pd
corpus_embedding=pd.DataFrame(emb) corpus_embedding.to_csv(quora_questions.csv,index=False)



