pietrolesci/eurlex-57k
收藏Hugging Face2023-09-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pietrolesci/eurlex-57k
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
- config_name: embedding_all-MiniLM-L12-v2
data_files:
- split: train
path: embedding_all-MiniLM-L12-v2/train-*
- split: validation
path: embedding_all-MiniLM-L12-v2/validation-*
- split: test
path: embedding_all-MiniLM-L12-v2/test-*
- config_name: embedding_all-mpnet-base-v2
data_files:
- split: train
path: embedding_all-mpnet-base-v2/train-*
- split: validation
path: embedding_all-mpnet-base-v2/validation-*
- split: test
path: embedding_all-mpnet-base-v2/test-*
- config_name: embedding_multi-qa-mpnet-base-dot-v1
data_files:
- split: train
path: embedding_multi-qa-mpnet-base-dot-v1/train-*
- split: validation
path: embedding_multi-qa-mpnet-base-dot-v1/validation-*
- split: test
path: embedding_multi-qa-mpnet-base-dot-v1/test-*
- config_name: eurovoc_concepts
data_files:
- split: train
path: eurovoc_concepts/train-*
dataset_info:
- config_name: default
features:
- name: celex_id
dtype: string
- name: document_type
dtype: string
- name: title
dtype: string
- name: header
dtype: string
- name: recitals
dtype: string
- name: main_body
sequence: string
- name: eurovoc_concepts
sequence: string
- name: text
dtype: string
- name: uid
dtype: int64
splits:
- name: train
num_bytes: 269684150
num_examples: 45000
- name: validation
num_bytes: 35266624
num_examples: 6000
- name: test
num_bytes: 35621361
num_examples: 6000
download_size: 0
dataset_size: 340572135
- config_name: embedding_all-MiniLM-L12-v2
features:
- name: uid
dtype: int64
- name: embedding_all-MiniLM-L12-v2
sequence: float32
splits:
- name: train
num_bytes: 69660000
num_examples: 45000
- name: validation
num_bytes: 9288000
num_examples: 6000
- name: test
num_bytes: 9288000
num_examples: 6000
download_size: 123441408
dataset_size: 88236000
- config_name: embedding_all-mpnet-base-v2
features:
- name: uid
dtype: int64
- name: embedding_all-mpnet-base-v2
sequence: float32
splits:
- name: train
num_bytes: 138780000
num_examples: 45000
- name: validation
num_bytes: 18504000
num_examples: 6000
- name: test
num_bytes: 18504000
num_examples: 6000
download_size: 211031101
dataset_size: 175788000
- config_name: embedding_multi-qa-mpnet-base-dot-v1
features:
- name: uid
dtype: int64
- name: embedding_multi-qa-mpnet-base-dot-v1
sequence: float32
splits:
- name: train
num_bytes: 138780000
num_examples: 45000
- name: validation
num_bytes: 18504000
num_examples: 6000
- name: test
num_bytes: 18504000
num_examples: 6000
download_size: 211029593
dataset_size: 175788000
- config_name: eurovoc_concepts
features:
- name: concept_id
dtype: string
- name: title
dtype: string
splits:
- name: train
num_bytes: 205049
num_examples: 7201
download_size: 157326
dataset_size: 205049
---
# Dataset Card for "eurlex-57k"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
pietrolesci
原始信息汇总
数据集概述
配置信息
-
默认配置 (
default)- 数据文件路径
- 训练集:
data/train-* - 验证集:
data/validation-* - 测试集:
data/test-*
- 训练集:
- 特征
celex_id: 字符串document_type: 字符串title: 字符串header: 字符串recitals: 字符串main_body: 字符串序列eurovoc_concepts: 字符串序列text: 字符串uid: 64位整数
- 数据分割
- 训练集: 45000个样本, 269684150字节
- 验证集: 6000个样本, 35266624字节
- 测试集: 6000个样本, 35621361字节
- 数据集大小
- 下载大小: 0字节
- 数据集大小: 340572135字节
- 数据文件路径
-
嵌入配置 (
embedding_all-MiniLM-L12-v2)- 数据文件路径
- 训练集:
embedding_all-MiniLM-L12-v2/train-* - 验证集:
embedding_all-MiniLM-L12-v2/validation-* - 测试集:
embedding_all-MiniLM-L12-v2/test-*
- 训练集:
- 特征
uid: 64位整数embedding_all-MiniLM-L12-v2: 32位浮点数序列
- 数据分割
- 训练集: 45000个样本, 69660000字节
- 验证集: 6000个样本, 9288000字节
- 测试集: 6000个样本, 9288000字节
- 数据集大小
- 下载大小: 123441408字节
- 数据集大小: 88236000字节
- 数据文件路径
-
嵌入配置 (
embedding_all-mpnet-base-v2)- 数据文件路径
- 训练集:
embedding_all-mpnet-base-v2/train-* - 验证集:
embedding_all-mpnet-base-v2/validation-* - 测试集:
embedding_all-mpnet-base-v2/test-*
- 训练集:
- 特征
uid: 64位整数embedding_all-mpnet-base-v2: 32位浮点数序列
- 数据分割
- 训练集: 45000个样本, 138780000字节
- 验证集: 6000个样本, 18504000字节
- 测试集: 6000个样本, 18504000字节
- 数据集大小
- 下载大小: 211031101字节
- 数据集大小: 175788000字节
- 数据文件路径
-
嵌入配置 (
embedding_multi-qa-mpnet-base-dot-v1)- 数据文件路径
- 训练集:
embedding_multi-qa-mpnet-base-dot-v1/train-* - 验证集:
embedding_multi-qa-mpnet-base-dot-v1/validation-* - 测试集:
embedding_multi-qa-mpnet-base-dot-v1/test-*
- 训练集:
- 特征
uid: 64位整数embedding_multi-qa-mpnet-base-dot-v1: 32位浮点数序列
- 数据分割
- 训练集: 45000个样本, 138780000字节
- 验证集: 6000个样本, 18504000字节
- 测试集: 6000个样本, 18504000字节
- 数据集大小
- 下载大小: 211029593字节
- 数据集大小: 175788000字节
- 数据文件路径
-
Eurovoc概念配置 (
eurovoc_concepts)- 数据文件路径
- 训练集:
eurovoc_concepts/train-*
- 训练集:
- 特征
concept_id: 字符串title: 字符串
- 数据分割
- 训练集: 7201个样本, 205049字节
- 数据集大小
- 下载大小: 157326字节
- 数据集大小: 205049字节
- 数据文件路径



