jealk/dk_retrieval_benchmark

Name: jealk/dk_retrieval_benchmark
Creator: jealk
Published: 2024-02-11 17:55:13
License: 暂无描述

Hugging Face2024-02-11 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/jealk/dk_retrieval_benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - da size_categories: - 10K<n<100K pretty_name: Retsinformation DK Retrieval Benchmark dataset_info: - config_name: generated_questions features: - name: title_vejledning dtype: string - name: chunk_text dtype: string - name: url dtype: string - name: generated_question dtype: string splits: - name: train num_bytes: 263556 num_examples: 200 download_size: 48578 dataset_size: 263556 - config_name: retsinformation features: - name: url dtype: string - name: title dtype: string - name: html_content dtype: string - name: text_content dtype: string splits: - name: train num_bytes: 62646653 num_examples: 433 download_size: 20333540 dataset_size: 62646653 configs: - config_name: generated_questions data_files: - split: train path: generated_questions/train-* - config_name: retsinformation data_files: - split: train path: retsinformation/train-* --- # Retsinformation retrieval benchmark Datasets related to generating a Q & Chunk dataset based on guides (vejledninger) from retsinformation.dk to be used as a retrieval benchmark. vejledninger_tekst.csv contains a dict with all vejledninger (scraped 8/11/23) from retsinformation.dk chunks_id_text.csv contains text chunks of max 512 token len, based on splitting all the text from vejledninger_tekst.csv, along with a unique id chunks_questions_100_samples.csv contains a sample of 200 auto-generated questions, based on the first 100 text chunks from the chunks_id_text.csv file, along with the matching text chunk.

提供机构：

jealk

原始信息汇总

Retsinformation DK Retrieval Benchmark 数据集概述

数据集信息

配置 `generated_questions`

特征:
- title_vejledning: 字符串类型
- chunk_text: 字符串类型
- url: 字符串类型
- generated_question: 字符串类型
分割:
- train:
  - 字节数: 263556
  - 样本数: 200
下载大小: 48578 字节
数据集大小: 263556 字节

配置 `retsinformation`

特征:
- url: 字符串类型
- title: 字符串类型
- html_content: 字符串类型
- text_content: 字符串类型
分割:
- train:
  - 字节数: 62646653
  - 样本数: 433
下载大小: 20333540 字节
数据集大小: 62646653 字节

配置信息

配置 generated_questions:
- 数据文件路径: generated_questions/train-*
配置 retsinformation:
- 数据文件路径: retsinformation/train-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集