sorenmulli/citizenship-test-da
收藏Hugging Face2024-01-15 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sorenmulli/citizenship-test-da
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: default
features:
- name: question
dtype: string
- name: index
dtype: int64
- name: option-A
dtype: string
- name: option-B
dtype: string
- name: option-C
dtype: string
- name: correct
dtype: string
- name: origin
dtype: string
splits:
- name: train
num_bytes: 103251.0
num_examples: 605
download_size: 43667
dataset_size: 103251.0
- config_name: raw
features:
- name: question
dtype: string
- name: index
dtype: int64
- name: option-A
dtype: string
- name: option-B
dtype: string
- name: option-C
dtype: string
- name: correct
dtype: string
- name: origin
dtype: string
splits:
- name: train
num_bytes: 103906
num_examples: 605
download_size: 45297
dataset_size: 103906
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- config_name: raw
data_files:
- split: train
path: raw/train-*
---
# [WIP] Dataset Card for "citizenship-test-da"
*Please note that this dataset and dataset card both are works in progress. For now refer to the related [thesis](https://sorenmulli.github.io/thesis/thesis.pdf) for all details*
This dataset contains scraped questions and answers from Danish citizen tests (Danish: *indfødsretsprøver* og *medborgerskabsprøver*) from Juni 2019 to May 2023 from PDF's produced by ''Styrelsen for International Rekruttering og Integration'' (SIRI).
The dataset is released as an appendix to the thesis [''Are GLLMs Danoliterate? Benchmarking Generative NLP in Danish''](https://sorenmulli.github.io/thesis/thesis.pdf) and permission by SIRI for this specific purpose.
The PDF's are available on [SIRI's website](https://siri.dk/nyheder/?categorizations=9115).
The `default` configuration has been semi-automatically cleaned to remove PDF artifacts using the [Alvenir 3gram DSL language model](https://github.com/danspeech/danspeech/releases/tag/v0.02-alpha).
The examples were not deduplicated.
提供机构:
sorenmulli
原始信息汇总
数据集概述
数据集配置
-
default
- 特征:
- question: string
- index: int64
- option-A: string
- option-B: string
- option-C: string
- correct: string
- origin: string
- 分割:
- train
- 字节数: 103251.0
- 样本数: 605
- train
- 下载大小: 43667
- 数据集大小: 103251.0
- 特征:
-
raw
- 特征:
- question: string
- index: int64
- option-A: string
- option-B: string
- option-C: string
- correct: string
- origin: string
- 分割:
- train
- 字节数: 103906
- 样本数: 605
- train
- 下载大小: 45297
- 数据集大小: 103906
- 特征:
数据文件
-
default
- 分割:
- train: data/train-*
- 分割:
-
raw
- 分割:
- train: raw/train-*
- 分割:



