jealk/dk_retrieval_benchmark
收藏Hugging Face2024-02-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/jealk/dk_retrieval_benchmark
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- da
size_categories:
- 10K<n<100K
pretty_name: Retsinformation DK Retrieval Benchmark
dataset_info:
- config_name: generated_questions
features:
- name: title_vejledning
dtype: string
- name: chunk_text
dtype: string
- name: url
dtype: string
- name: generated_question
dtype: string
splits:
- name: train
num_bytes: 263556
num_examples: 200
download_size: 48578
dataset_size: 263556
- config_name: retsinformation
features:
- name: url
dtype: string
- name: title
dtype: string
- name: html_content
dtype: string
- name: text_content
dtype: string
splits:
- name: train
num_bytes: 62646653
num_examples: 433
download_size: 20333540
dataset_size: 62646653
configs:
- config_name: generated_questions
data_files:
- split: train
path: generated_questions/train-*
- config_name: retsinformation
data_files:
- split: train
path: retsinformation/train-*
---
# Retsinformation retrieval benchmark
Datasets related to generating a Q & Chunk dataset based on guides (vejledninger) from retsinformation.dk to be used as a retrieval benchmark.
vejledninger_tekst.csv contains a dict with all vejledninger (scraped 8/11/23) from retsinformation.dk
chunks_id_text.csv contains text chunks of max 512 token len, based on splitting all the text from vejledninger_tekst.csv, along with a unique id
chunks_questions_100_samples.csv contains a sample of 200 auto-generated questions, based on the first 100 text chunks from the chunks_id_text.csv file, along with the matching text chunk.
提供机构:
jealk
原始信息汇总
Retsinformation DK Retrieval Benchmark 数据集概述
数据集信息
配置 generated_questions
- 特征:
title_vejledning: 字符串类型chunk_text: 字符串类型url: 字符串类型generated_question: 字符串类型
- 分割:
train:- 字节数: 263556
- 样本数: 200
- 下载大小: 48578 字节
- 数据集大小: 263556 字节
配置 retsinformation
- 特征:
url: 字符串类型title: 字符串类型html_content: 字符串类型text_content: 字符串类型
- 分割:
train:- 字节数: 62646653
- 样本数: 433
- 下载大小: 20333540 字节
- 数据集大小: 62646653 字节
配置信息
- 配置
generated_questions:- 数据文件路径:
generated_questions/train-*
- 数据文件路径:
- 配置
retsinformation:- 数据文件路径:
retsinformation/train-*
- 数据文件路径:



