five

karmiq/wikipedia-embeddings-cs-e5-base

收藏
Hugging Face2024-01-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/karmiq/wikipedia-embeddings-cs-e5-base
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: chunks sequence: string - name: embeddings sequence: sequence: float32 splits: - name: train num_bytes: 5021489124 num_examples: 534044 download_size: 4750515911 dataset_size: 5021489124 configs: - config_name: default data_files: - split: train path: data/train-* language: - cs size_categories: - 100K<n<1M task_categories: - text-generation - fill-mask license: - cc-by-sa-3.0 - gfdl --- This dataset contains the Czech subset of the [`wikimedia/wikipedia`](https://huggingface.co/datasets/wikimedia/wikipedia) dataset. Each page is divided into paragraphs, stored as a list in the `chunks` column. For every paragraph, embeddings are created using the [`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base) model. ## Usage Load the dataset: ```python from datasets import load_dataset ds = load_dataset("karmiq/wikipedia-embeddings-cs-e5-base", split="train") ds[1] ``` ``` { 'id': '1', 'url': 'https://cs.wikipedia.org/wiki/Astronomie', 'title': 'Astronomie', 'chunks': [ 'Astronomie, řecky αστρονομία z άστρον ( astron ) hvězda a νόμος ( nomos )...', 'Myšlenky Aristotelovy rozvinul ve 2. století našeho letopočtu Klaudios Ptolemaios...', ..., ], 'embeddings': [ [0.09006806463003159, -0.009814552962779999, ...], [0.10767366737127304, ...], ... ] } ``` The structure makes it easy to use the dataset for implementing semantic search. <details> <summary>Load the data in Elasticsearch</summary> ```python def doc_generator(data, batch_size=1000): for batch in data.with_format("numpy").iter(batch_size): for i, id in enumerate(batch["id"]): output = {"id": id} output["title"] = batch["title"][i] output["url"] = batch["url"][i] output["parts"] = [ { "chunk": chunk, "embedding": embedding } for chunk, embedding in zip(batch["chunks"][i], batch["embeddings"][i]) ] yield output num_indexed, num_failed = 0, 0, progress = tqdm(total=ds.num_rows, unit="doc", desc="Indexing") for ok, info in parallel_bulk( es, index="wikipedia-search", actions=doc_generator(ds), raise_on_error=False, ): if not ok: print(f"ERROR {info['index']['status']}: " f"{info['index']['error']['type']}: {info['index']['error']['caused_by']['type']}: " f"{info['index']['error']['caused_by']['reason'][:250]}") progress.update(1) ``` </details> <details> <summary>Use <code>sentence_transformers.util.semantic_search</code></summary> ```python import sentence_transformers model = sentence_transformers.SentenceTransformer("intfloat/multilingual-e5-base") ds.set_format(type="torch", columns=["embeddings"], output_all_columns=True) # Flatten the dataset def explode_sequence(batch): output = { "id": [], "url": [], "title": [], "chunk": [], "embedding": [] } for id, url, title, chunks, embeddings in zip( batch["id"], batch["url"], batch["title"], batch["chunks"], batch["embeddings"] ): output["id"].extend([id for _ in range(len(chunks))]) output["url"].extend([url for _ in range(len(chunks))]) output["title"].extend([title for _ in range(len(chunks))]) output["chunk"].extend(chunks) output["embedding"].extend(embeddings) return output ds_flat = ds.map( explode_sequence, batched=True, remove_columns=ds.column_names, num_proc=min(os.cpu_count(), 32), desc="Flatten") ds_flat query = "Čím se zabývá fyzika?" hits = sentence_transformers.util.semantic_search( query_embeddings=model.encode(query), corpus_embeddings=ds_flat["embedding"], top_k=10) for hit in hits[0]: title = ds_flat[hit['corpus_id']]['title'] chunk = ds_flat[hit['corpus_id']]['chunk'] print(f"[{hit['score']:0.2f}] {textwrap.shorten(chunk, width=100, placeholder='…')} [{title}]") # [0.90] Fyzika částic ( též částicová fyzika ) je oblast fyziky, která se zabývá částicemi. V širším smyslu… [Fyzika částic] # [0.89] Fyzika ( z řeckého φυσικός ( fysikos ): přírodní, ze základu φύσις ( fysis ): příroda, archaicky… [Fyzika] # ... ``` </details> The embeddings generation took about 2 hours on an NVIDIA A100 80GB GPU. ## License See license of the original dataset: <https://huggingface.co/datasets/wikimedia/wikipedia>.

数据集信息: 特征: - 名称:id,数据类型:字符串 - 名称:url(统一资源定位符,Uniform Resource Locator),数据类型:字符串 - 名称:标题,数据类型:字符串 - 名称:分块(chunks),类型为字符串序列 - 名称:嵌入向量(embeddings),类型为嵌套的32位浮点数序列 划分集: - 名称:训练集,占用字节数:5021489124,样本数量:534044 下载大小:4750515911,数据集总占用大小:5021489124 配置项: - 配置名称:默认 数据文件: - 划分集:训练集,路径:data/train-* 语言:捷克语 规模类别:100000 < 样本数 < 1000000 任务类别:文本生成、掩码填充 许可协议:CC BY-SA 3.0(知识共享署名-相同方式共享3.0)、GFDL(GNU自由文档许可证) --- 本数据集包含[`wikimedia/wikipedia`](https://huggingface.co/datasets/wikimedia/wikipedia)数据集的捷克语子集。每篇维基百科页面均被拆分为段落,存储于`chunks`列的列表中。针对每个段落,均使用[`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base)模型生成嵌入向量。 ## 使用方法 加载数据集: python from datasets import load_dataset ds = load_dataset("karmiq/wikipedia-embeddings-cs-e5-base", split="train") ds[1] 返回的示例数据格式如下: python { 'id': '1', 'url': 'https://cs.wikipedia.org/wiki/Astronomie', 'title': 'Astronomie', 'chunks': [ 'Astronomie, řecky αστρονομία z άστρον ( astron ) hvězda a νόμος ( nomos )...', 'Myšlenky Aristotelovy rozvinul ve 2. století našeho letopočtu Klaudios Ptolemaios...', ..., ], 'embeddings': [ [0.09006806463003159, -0.009814552962779999, ...], [0.10767366737127304, ...], ... ] } 该数据集结构便于实现语义搜索相关应用。 <details> <summary>在Elasticsearch中加载数据</summary> python def doc_generator(data, batch_size=1000): for batch in data.with_format("numpy").iter(batch_size): for i, id in enumerate(batch["id"]): output = {"id": id} output["title"] = batch["title"][i] output["url"] = batch["url"][i] output["parts"] = [ { "chunk": chunk, "embedding": embedding } for chunk, embedding in zip(batch["chunks"][i], batch["embeddings"][i]) ] yield output num_indexed, num_failed = 0, 0 progress = tqdm(total=ds.num_rows, unit="doc", desc="Indexing") for ok, info in parallel_bulk( es, index="wikipedia-search", actions=doc_generator(ds), raise_on_error=False, ): if not ok: print(f"ERROR {info['index']['status']}: " f"{info['index']['error']['type']}: {info['index']['error']['caused_by']['type']}: " f"{info['index']['error']['caused_by']['reason'][:250]}") progress.update(1) </details> <details> <summary>使用<code>sentence_transformers.util.semantic_search</code>实现语义搜索</summary> python import sentence_transformers model = sentence_transformers.SentenceTransformer("intfloat/multilingual-e5-base") ds.set_format(type="torch", columns=["embeddings"], output_all_columns=True) # 展平嵌套序列数据 def explode_sequence(batch): output = { "id": [], "url": [], "title": [], "chunk": [], "embedding": [] } for id, url, title, chunks, embeddings in zip( batch["id"], batch["url"], batch["title"], batch["chunks"], batch["embeddings"] ): output["id"].extend([id for _ in range(len(chunks))]) output["url"].extend([url for _ in range(len(chunks))]) output["title"].extend([title for _ in range(len(chunks))]) output["chunk"].extend(chunks) output["embedding"].extend(embeddings) return output ds_flat = ds.map( explode_sequence, batched=True, remove_columns=ds.column_names, num_proc=min(os.cpu_count(), 32), desc="展平数据") ds_flat query = "物理学研究的内容是什么?" hits = sentence_transformers.util.semantic_search( query_embeddings=model.encode(query), corpus_embeddings=ds_flat["embedding"], top_k=10) for hit in hits[0]: title = ds_flat[hit['corpus_id']]['title'] chunk = ds_flat[hit['corpus_id']]['chunk'] print(f"[{hit['score']:0.2f}] {textwrap.shorten(chunk, width=100, placeholder='…')} [{title}]") # [0.90] Fyzika částic ( též částicová fyzika ) je oblast fyziky, která se zabývá částicemi. V širším smyslu… [Fyzika částic] # [0.89] Fyzika ( z řeckého φυσικός ( fysikos ): přírodní, ze základu φύσις ( fysis ): příroda, archaicky… [Fyzika] # ... </details> 嵌入向量的生成在NVIDIA A100 80GB GPU上耗时约2小时。 ## 许可协议 详见原数据集的许可协议:<https://huggingface.co/datasets/wikimedia/wikipedia>.
提供机构:
karmiq
原始信息汇总

数据集概述

数据集信息

  • 特征:
    • id: 字符串类型
    • url: 字符串类型
    • title: 字符串类型
    • chunks: 字符串序列
    • embeddings: 浮点数序列的序列
  • 分割:
    • train: 包含534044个样本,总大小为5021489124字节
  • 下载大小: 4750515911字节
  • 数据集大小: 5021489124字节

配置

  • 默认配置:
    • train分割的数据文件路径: data/train-*

语言

  • 捷克语 (cs)

大小类别

  • 100K < n < 1M

任务类别

  • 文本生成
  • 填充掩码

许可证

  • CC BY-SA 3.0
  • GFDL

数据集描述

  • 该数据集包含来自维基百科的捷克语子集。每个页面被划分为段落,存储在chunks列中。每个段落使用intfloat/multilingual-e5-base模型生成嵌入向量,存储在embeddings列中。

使用示例

  • 加载数据集的示例代码如下: python from datasets import load_dataset

    ds = load_dataset("karmiq/wikipedia-embeddings-cs-e5-base", split="train") ds[1]

    输出示例: json { "id": "1", "url": "https://cs.wikipedia.org/wiki/Astronomie", "title": "Astronomie", "chunks": [ "Astronomie, řecky αστρονομία z άστρον ( astron ) hvězda a νόμος ( nomos )...", "Myšlenky Aristotelovy rozvinul ve 2. století našeho letopočtu Klaudios Ptolemaios...", ... ], "embeddings": [ [0.09006806463003159, -0.009814552962779999, ...], [0.10767366737127304, ...], ... ] }

嵌入生成时间

  • 使用NVIDIA A100 80GB GPU生成嵌入大约需要2小时。
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集包含捷克语维基百科页面的段落及其对应的嵌入向量,适用于语义搜索任务。数据集规模为534,044行,总大小4.75GB,使用`intfloat/multilingual-e5-base`模型生成嵌入。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作