five

karmiq/wikipedia-embeddings-cs-seznam-mpnet

收藏
Hugging Face2024-02-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/karmiq/wikipedia-embeddings-cs-seznam-mpnet
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: chunks sequence: string - name: embeddings sequence: sequence: float32 splits: - name: train num_bytes: 2580729273 num_examples: 534044 download_size: 2307703671 dataset_size: 2580729273 configs: - config_name: default data_files: - split: train path: data/train-* language: - cs size_categories: - 100K<n<1M task_categories: - text-generation - fill-mask license: - cc-by-sa-3.0 - gfdl --- This dataset contains the Czech subset of the [`wikimedia/wikipedia`](https://huggingface.co/datasets/wikimedia/wikipedia) dataset. Each page is divided into paragraphs, stored as a list in the `chunks` column. For every paragraph, embeddings are created using the [`Seznam/simcse-dist-mpnet-paracrawl-cs-en`](https://huggingface.co/Seznam/simcse-dist-mpnet-paracrawl-cs-en) model. ## Usage Load the dataset: ```python from datasets import load_dataset ds = load_dataset("karmiq/wikipedia-embeddings-cs-seznam-mpnet", split="train") ds[1] ``` ``` { 'id': '1', 'url': 'https://cs.wikipedia.org/wiki/Astronomie', 'title': 'Astronomie', 'chunks': [ 'Astronomie, řecky αστρονομία z άστρον ( astron ) hvězda a νόμος ( nomos ) ...', 'Novověk Roku 1514 navrhl Mikuláš Koperník nový model, ve kterém bylo ...', ..., ], 'embeddings': [ [ 0.653917670249939, -0.879465639591217, 0.3993946313858032, ... ] [ 0.0035442777443677187, -1.0201066732406616, -0.06573136150836945, ... ] ] } ``` The structure makes it easy to use the dataset for implementing semantic search. <details> <summary>Load the data in Elasticsearch</summary> ```python def doc_generator(data, batch_size=1000): for batch in data.with_format("numpy").iter(batch_size): for i, id in enumerate(batch["id"]): output = {"id": id} output["title"] = batch["title"][i] output["url"] = batch["url"][i] output["parts"] = [ { "chunk": chunk, "embedding": embedding } for chunk, embedding in zip(batch["chunks"][i], batch["embeddings"][i]) ] yield output num_indexed, num_failed = 0, 0, progress = tqdm(total=ds.num_rows, unit="doc", desc="Indexing") for ok, info in parallel_bulk( es, index="wikipedia-search", actions=doc_generator(ds), raise_on_error=False, ): if not ok: print(f"ERROR {info['index']['status']}: {info['index']['error']}" progress.update(1) ``` </details> <details> <summary>Use <code>sentence_transformers.util.semantic_search</code></summary> ```python import os import textwrap import sentence_transformers from sentence_transformers.models import Transformer, Pooling from sentence_transformers import SentenceTransformer from sentence_transformers.models import Transformer, Pooling embedding_model = Transformer("Seznam/simcse-dist-mpnet-paracrawl-cs-en") pooling = Pooling(word_embedding_dimension=embedding_model.get_word_embedding_dimension(), pooling_mode="cls") model = SentenceTransformer(modules=[embedding_model, pooling]) ds.set_format(type="torch", columns=["embeddings"], output_all_columns=True) # Flatten the dataset def explode_sequence(batch): output = { "id": [], "url": [], "title": [], "chunk": [], "embedding": [] } for id, url, title, chunks, embeddings in zip( batch["id"], batch["url"], batch["title"], batch["chunks"], batch["embeddings"] ): output["id"].extend([id for _ in range(len(chunks))]) output["url"].extend([url for _ in range(len(chunks))]) output["title"].extend([title for _ in range(len(chunks))]) output["chunk"].extend(chunks) output["embedding"].extend(embeddings) return output ds_flat = ds.map( explode_sequence, batched=True, remove_columns=ds.column_names, num_proc=min(os.cpu_count(), 32), desc="Flatten") ds_flat query = "Čím se zabývá fyzika?" hits = sentence_transformers.util.semantic_search( query_embeddings=model.encode(query), corpus_embeddings=ds_flat["embedding"], top_k=10) for hit in hits[0]: title = ds_flat[hit['corpus_id']]['title'] chunk = ds_flat[hit['corpus_id']]['chunk'] print(f"[{hit['score']:0.2f}] {textwrap.shorten(chunk, width=100, placeholder='…')} [{title}]") # [0.72] Molekulová fyzika ( též molekulární fyzika ) je část fyziky, která zkoumá látky na úrovni atomů a… [Molekulová fyzika] # [0.70] Fyzika ( z řeckého φυσικός ( fysikos ): přírodní, ze základu φύσις ( fysis ): příroda, archaicky… [Fyzika] # ... ``` </details> The embeddings generation took about 35 minutes on an NVIDIA A100 80GB. ## License See license of the original dataset: <https://huggingface.co/datasets/wikimedia/wikipedia>.
提供机构:
karmiq
原始信息汇总

数据集概述

数据集信息

  • 特征:
    • id: 字符串类型
    • url: 字符串类型
    • title: 字符串类型
    • chunks: 字符串序列
    • embeddings: 浮点数序列的序列
  • 分割:
    • train: 包含534044个样本,总大小为2580729273字节
  • 下载大小: 2307703671字节
  • 数据集大小: 2580729273字节

配置

  • 默认配置:
    • 数据文件路径: data/train-*

语言

  • 捷克语

大小类别

  • 100K < n < 1M

任务类别

  • 文本生成
  • 填充掩码

许可证

  • CC BY-SA 3.0
  • GFDL

描述

该数据集包含wikimedia/wikipedia数据集的捷克子集。每个页面被划分为段落,存储在chunks列中。每个段落使用Seznam/simcse-dist-mpnet-paracrawl-cs-en模型生成嵌入。

使用示例

python from datasets import load_dataset

ds = load_dataset("karmiq/wikipedia-embeddings-cs-seznam-mpnet", split="train") ds[1]

输出示例: json { "id": "1", "url": "https://cs.wikipedia.org/wiki/Astronomie", "title": "Astronomie", "chunks": [ "Astronomie, řecky αστρονομία z άστρον ( astron ) hvězda a νόμος ( nomos ) ...", "Novověk Roku 1514 navrhl Mikuláš Koperník nový model, ve kterém bylo ...", ... ], "embeddings": [ [ 0.653917670249939, -0.879465639591217, 0.3993946313858032, ... ], [ 0.0035442777443677187, -1.0201066732406616, -0.06573136150836945, ... ] ] }

结构

数据集结构便于实现语义搜索。

嵌入生成时间

嵌入生成大约需要35分钟(在NVIDIA A100 80GB上)。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作