TiLt-HS
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13625625
下载链接
链接失效反馈官方服务:
资源简介:
TiLt-HS (Tests in Lithuanian, High School) is a dataset of multiple-choice question tests that are used to assess the knowledge of high school students in several academic areas.
Dataset Details
The dataset was collected in August, 2024.
Dataset Description
Point of contact: Jekaterina Novikova
Language: Lithuanian (lt)
License: CC-BY-NC-SA-4.0
Uses
Direct Use
The dataset is intended to be used as a subset of training data for the development of multilingual language models.
It can be used together with the TiLt-Pro dataset, as both follow the exact same structure.
How to Use the Dataset
Huggingface Datasets:
from datasets import load_dataset
ds = load_dataset("Jekaterina/tilt-hs")
Pandas:
import pandas as pd
df = pd.read_json("hf://datasets/Jekaterina/tilt-hs/high_school_economics_lt.json")
Dataset Structure
Data Instances
A typical data point comprises some meta data, including language of the test, the country of origin, file source and name, number of the test's specific question, and a license of the test. The main content contains the question, suggested options for the answer and the correct answer. In addition, some information about the nature of the test is provided, including test level and category in both original language and in English.
An example from the TiLt-HS dataset looks as follows:
{
"language": "lt",
"country": "Lithuania",
"file_name": "Elektroninė-versija-Ekonomikos-ir-finansu-uzduotys-su-atsakymais.pdf",
"source": "https://www.nmakademija.lt/wp-content/uploads/2020/08/",
"license": "no license",
"level": "high school",
"category_en": "microeconomics",
"category_original_lang": "mikroekonomika",
"original_question_num": 1,
"question": "Nuperkamų prekių kiekio priklausomybė nuo kainos vadinama",
"options": [
"pasiūlos funkcija",
"paklausos funkcija",
"rinkos pusiausvyra",
"sąnaudų padengimo tašku"
],
"answer": "2"
}
Data Fields
language: The language of the sample.
country: The country of the sample.
source: The ID or URL of the source.
file_name: The name of the source file.
license: License of the source.
level: The academic level of the tested knowledge. Can be one of the following: e.g. Middle School, High School, University Entrance, University, Professional etc.
category_en: The low level category according to the source IN ENGLISH, i.e. the exam name.
category_original_lang: Similarly to the category_en attribute, the low level category according to the source IN THE ORIGINAL LANGUAGE.
original_question_num: Id of the sample.
question: The text of the multi-choice question or the statement of the True-False question.
options: A list of texts of the multiple choice options. In the case of True-False task, it is left blank.
answer: The string representing the correct choice(s) (1, 2, 3, ...). It corresponds to the numbering of the respective choice columns. In the case of True-False task, this is "1" for true and "0" for false.
Data Splits
The dataset is not split and has only a train subset.
The dataset contains tests for three academic topics, with the following individual number of datapoints/test questions in each:
Train
microeconomics
71
macroeconomics
25
finances and bookkeeping, year 2018
46
Dataset Creation
Curation Rationale
This dataset was collected as a part of the AYA Expedition initiative, for the Global Exams project.
Source Data
The original tests for all the topics were downloaded from the website of the National Academy of Students. Exact link to each test question is provided in the source and file_name fields of the dataset.
Data Collection and Processing
Tests were downloaded during the period of August 25-27, 2024.
Only the questions with multiple-choice answers were selected from the tests. Questions depending on images/figures to be answered, were filtered out. Questions that required reading a paragraph & answering were not included either. Only a subset of available professional tests was included in the current version of the dataset.
Annotations
The dataset does not contain any additional annotations.
Personal and Sensitive Information
No personal, sensitive or private information included in the dataset.
Bias, Risks, and Limitations
This dataset only includes the tests for a certain limited number of academic topics. There is a risk that some of the knowledge presented in the tests changes with time, especially in the area of finance and bookkeeping, so it is important to pay attention to the date of the data collection. This dataset only contains of Lithuanian academic tests and does not necessarily generalize to other languages.
Recommendations
Users should be aware of the risks and limitations of the dataset.
Dataset Card Author
Jekaterina Novikova
Dataset Card Contact
Jekaterina Novikova
创建时间:
2024-09-01



