five

laion/wikipedia_de_retival_BGE-m3

收藏
Hugging Face2024-04-24 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/laion/wikipedia_de_retival_BGE-m3
下载链接
链接失效反馈
官方服务:
资源简介:
import os import pandas as pd from pathlib import Path import retriv retriv.set_base_path("./retriv_wiki_de") from retriv import DenseRetriever """ # Uncomment if you wanna make your own index dr = DenseRetriever( index_name="wiki_de-index_sentence_transf-BAAI/bge-m3_title_only_fullarticles", model="BAAI/bge-m3", normalize=True, max_length=512, use_ann=True, ) dr = dr.index_file( path="./wikipedia_de_filtered_fullarticles.csv", # File kind is automatically inferred embeddings_path=None, # Default value use_gpu=True, # Default value batch_size=32, # Default value show_progress=True, # Default value callback=lambda doc: { # Callback defaults to None. "id": doc["id"], "text": doc["title"], }, ) """ from retriv import DenseRetriever # loading the wikipedia de text data file_path = "./wikipedia_de_filtered_fullarticles.csv" # CSV with fulltext df = pd.read_csv(file_path) file_path = "./wikipedia_de_filtered_300wordchunks.csv" # CSV with fulltext df2 = pd.read_csv(file_path) # loading the retrievers dr = DenseRetriever.load("wiki_de-index_sentence_transf-BAAI/bge-m3_title_only_fullarticles") # the embeddings here are made from the titles of the wikipedia pages, but can be matched to the full texts in the wikipedia_de_filtered_fullarticles.csv result = dr.search( query="was is der doppelspaltversuch?", # What to search for return_docs=True, # Default value, return the text of the documents cutoff=3, # Default value, number of results to return ) print(df) for res in result: id_query = int(res["id"])-1 row = df.iloc[id_query] print(row) # Extracting 'text' and 'url' from the resulting row result_text = row['text'] result_url = row['url'] print(result_url,result_text[:1000]) print("###################") print("+++++++++++++++++++") dr2 = DenseRetriever.load("wiki_de-index_sentence_transf-BAAI/bge-m3") # the embeddings here are made from 300 word segments of the articles. The IDs point to wikipedia_de_filtered_300wordchunks.csv result = dr2.search( query="was is der doppelspaltversuch?", # What to search for return_docs=True, # Default value, return the text of the documents cutoff=3, # Default value, number of results to return ) for res in result: id_query = int(res["id"])-1 # the "id" values start with 1, not 0 , -> need to substract 1 ;) row = df2.iloc[id_query] print(row) # Extracting 'text' and 'url' from the resulting row result_text = row['text'] result_url = row['url'] print(result_url,result_text) print("########")
提供机构:
laion
原始信息汇总

数据集概述

数据集文件

  • wikipedia_de_filtered_fullarticles.csv:包含完整文本的CSV文件。
  • wikipedia_de_filtered_300wordchunks.csv:包含每篇文章300字片段的CSV文件。

数据集处理

  • 使用DenseRetriever加载和索引数据集。
    • 索引名称:wiki_de-index_sentence_transf-BAAI/bge-m3_title_only_fullarticles
    • 模型:BAAI/bge-m3
    • 参数设置:
      • 归一化:True
      • 最大长度:512
      • 使用GPU:True
      • 批处理大小:32
      • 显示进度:True
  • 通过查询进行搜索,返回文档文本,默认返回前3个结果。

数据集应用示例

  • 搜索查询:was is der doppelspaltversuch?
  • 结果处理:
    • wikipedia_de_filtered_fullarticles.csvwikipedia_de_filtered_300wordchunks.csv中提取文本和URL。
    • 打印搜索结果的文本和URL。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作