five

aisuko/quora_questions

收藏
Hugging Face2024-02-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/aisuko/quora_questions
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 --- # Overview Original from the sentences-transformers library. Only for researching purposes. Adapter by Aisuko # Installation ```python !pip install sentence-transformers==2.3.1 ``` # Computing Embeddings for a large set of sentences ```python import os import csv import time from sentence_transformers import SentenceTransformer from sentence_transformers.util import http_get if __name__=='__main__': url='http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv' dataset_path='quora_duplicate_questions.tsv' # max_corpus_size=50000 # max number of sentences to deal with if not os.path.exists(dataset_path): http_get(url, dataset_path) # get all unique sentences from the file corpus_sentences=set() with open(dataset_path, encoding='utf8') as fIn: reader=csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_MINIMAL) for row in reader: corpus_sentences.add(row['question1']) corpus_sentences.add(row['question2']) # if len(corpus_sentences)>=max_corpus_size: # break corpus_sentences=list(corpus_sentences) model=SentenceTransformer('all-MiniLM-L6-v2').to('cuda') model.max_seq_length=256 pool=model.start_multi_process_pool() # computing the embeddings using the multi-process pool emb=model.encode_multi_process(corpus_sentences, pool,batch_size=128,chunk_size=1024,normalize_embeddings=True) print('Embeddings computed. Shape:', emb.shape) # optional : stop the processes in the pool model.stop_multi_process_pool(pool) ``` # Save the csv file ```python import pandas as pd corpus_embedding=pd.DataFrame(emb) corpus_embedding.to_csv('quora_questions.csv',index=False) ```
提供机构:
aisuko
原始信息汇总

数据集概述

该数据集源自 sentences-transformers 库,仅用于研究目的。

安装

python !pip install sentence-transformers==2.3.1

计算大量句子的嵌入

python import os import csv import time

from sentence_transformers import SentenceTransformer from sentence_transformers.util import http_get

if name==main: url=http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv dataset_path=quora_duplicate_questions.tsv

if not os.path.exists(dataset_path):
    http_get(url, dataset_path)

corpus_sentences=set()
with open(dataset_path, encoding=utf8) as fIn:
    reader=csv.DictReader(fIn, delimiter=	, quoting=csv.QUOTE_MINIMAL)
    for row in reader:
        corpus_sentences.add(row[question1])
        corpus_sentences.add(row[question2])
            
corpus_sentences=list(corpus_sentences)
model=SentenceTransformer(all-MiniLM-L6-v2).to(cuda)
model.max_seq_length=256

pool=model.start_multi_process_pool()

emb=model.encode_multi_process(corpus_sentences, pool,batch_size=128,chunk_size=1024,normalize_embeddings=True)
print(Embeddings computed. Shape:, emb.shape)

model.stop_multi_process_pool(pool)

保存CSV文件

python import pandas as pd

corpus_embedding=pd.DataFrame(emb) corpus_embedding.to_csv(quora_questions.csv,index=False)

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作