GlobalCampus/openalex-multilingual-embeddings

Name: GlobalCampus/openalex-multilingual-embeddings
Creator: GlobalCampus
Published: 2024-01-26 08:34:09
License: 暂无描述

Hugging Face2024-01-26 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/GlobalCampus/openalex-multilingual-embeddings

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: string - name: embedding sequence: float64 splits: - name: train num_bytes: 751739666430 num_examples: 243212198 download_size: 640572858900 dataset_size: 751739666430 configs: - config_name: default data_files: - split: train path: data/train-* license: cc0-1.0 tags: - openalex - embeddings pretty_name: OpenAlex Mutilingual Embeddings source_dataset: - openalex --- # OpenAlex Multilingual Embeddings This dataset contains multilingual text embeddings of all records in [OpenAlex](https://openalex.org/) with a title or an abstract from the snapshot of 2023-10-20. The dataset was created for the [FORAS project](https://asreview.nl/project/foras/) to investigate the efficacy of different methods of searching in databases of academic publications. All scripts will be available in a [GitHub repository](https://github.com/IDfuse/foras). The project is supported by a grant from the Dutch Research Council (grant no. 406.22.GO.048) ## Description of the data - The dataset has two columns, `id` and `embedding`. The `id` columns contains the OpenAlex identifier of the record. The `embedding` column contains the text embedding, which is a vector of 384 floats. - The multilingual embedding model [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) was used to generate the embeddings. For every with a title or abstract we generated an embedding of `'query: '` + `title` + `' '` + `abstract`. The model has a maximum token input length of 512 tokens.

该数据集包含OpenAlex记录的多语言文本嵌入，这些记录具有标题或摘要，数据来源于2023年10月20日的快照。数据集主要用于FORAS项目，研究在学术出版物数据库中不同搜索方法的有效性。数据集包含两列：`id`和`embedding`，其中`id`是OpenAlex记录的标识符，`embedding`是由intfloat/multilingual-e5-small模型生成的384维浮点数向量。

提供机构：

GlobalCampus

原始信息汇总

OpenAlex Multilingual Embeddings 数据集概述

数据集信息

特征:
- id: 字符串类型，包含记录的 OpenAlex 标识符。
- embedding: 序列类型，包含文本嵌入，是一个由 384 个浮点数组成的向量。
分割:
- train: 包含 243212198 个样本，总大小为 751739666430 字节。
下载大小: 640572858900 字节。
数据集大小: 751739666430 字节。
配置:
- default: 数据文件路径为 data/train-*。
许可证: cc0-1.0。
标签:
- openalex
- embeddings
美观名称: OpenAlex Mutilingual Embeddings
来源数据集: openalex

数据描述

该数据集包含两个列：id 和 embedding。
embedding 列中的文本嵌入是通过 intfloat/multilingual-e5-small 模型生成的，每个具有标题或摘要的记录生成了一个嵌入，格式为 query: + title + + abstract。该模型的最大输入长度为 512 个令牌。

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集