laion/wikipedia_de_retival_BGE-m3
收藏Hugging Face2024-04-24 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/laion/wikipedia_de_retival_BGE-m3
下载链接
链接失效反馈官方服务:
资源简介:
import os
import pandas as pd
from pathlib import Path
import retriv
retriv.set_base_path("./retriv_wiki_de")
from retriv import DenseRetriever
"""
# Uncomment if you wanna make your own index
dr = DenseRetriever(
index_name="wiki_de-index_sentence_transf-BAAI/bge-m3_title_only_fullarticles",
model="BAAI/bge-m3",
normalize=True,
max_length=512,
use_ann=True,
)
dr = dr.index_file(
path="./wikipedia_de_filtered_fullarticles.csv", # File kind is automatically inferred
embeddings_path=None, # Default value
use_gpu=True, # Default value
batch_size=32, # Default value
show_progress=True, # Default value
callback=lambda doc: { # Callback defaults to None.
"id": doc["id"],
"text": doc["title"],
},
)
"""
from retriv import DenseRetriever
# loading the wikipedia de text data
file_path = "./wikipedia_de_filtered_fullarticles.csv" # CSV with fulltext
df = pd.read_csv(file_path)
file_path = "./wikipedia_de_filtered_300wordchunks.csv" # CSV with fulltext
df2 = pd.read_csv(file_path)
# loading the retrievers
dr = DenseRetriever.load("wiki_de-index_sentence_transf-BAAI/bge-m3_title_only_fullarticles") # the embeddings here are made from the titles of the wikipedia pages, but can be matched to the full texts in the wikipedia_de_filtered_fullarticles.csv
result = dr.search(
query="was is der doppelspaltversuch?", # What to search for
return_docs=True, # Default value, return the text of the documents
cutoff=3, # Default value, number of results to return
)
print(df)
for res in result:
id_query = int(res["id"])-1
row = df.iloc[id_query]
print(row)
# Extracting 'text' and 'url' from the resulting row
result_text = row['text']
result_url = row['url']
print(result_url,result_text[:1000])
print("###################")
print("+++++++++++++++++++")
dr2 = DenseRetriever.load("wiki_de-index_sentence_transf-BAAI/bge-m3") # the embeddings here are made from 300 word segments of the articles. The IDs point to wikipedia_de_filtered_300wordchunks.csv
result = dr2.search(
query="was is der doppelspaltversuch?", # What to search for
return_docs=True, # Default value, return the text of the documents
cutoff=3, # Default value, number of results to return
)
for res in result:
id_query = int(res["id"])-1 # the "id" values start with 1, not 0 , -> need to substract 1 ;)
row = df2.iloc[id_query]
print(row)
# Extracting 'text' and 'url' from the resulting row
result_text = row['text']
result_url = row['url']
print(result_url,result_text)
print("########")
提供机构:
laion
原始信息汇总
数据集概述
数据集文件
- wikipedia_de_filtered_fullarticles.csv:包含完整文本的CSV文件。
- wikipedia_de_filtered_300wordchunks.csv:包含每篇文章300字片段的CSV文件。
数据集处理
- 使用
DenseRetriever加载和索引数据集。- 索引名称:
wiki_de-index_sentence_transf-BAAI/bge-m3_title_only_fullarticles - 模型:
BAAI/bge-m3 - 参数设置:
- 归一化:
True - 最大长度:
512 - 使用GPU:
True - 批处理大小:
32 - 显示进度:
True
- 归一化:
- 索引名称:
- 通过查询进行搜索,返回文档文本,默认返回前3个结果。
数据集应用示例
- 搜索查询:
was is der doppelspaltversuch? - 结果处理:
- 从
wikipedia_de_filtered_fullarticles.csv和wikipedia_de_filtered_300wordchunks.csv中提取文本和URL。 - 打印搜索结果的文本和URL。
- 从



