five

TiLt-HS

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13625625
下载链接
链接失效反馈
官方服务:
资源简介:
TiLt-HS (Tests in Lithuanian, High School) is a dataset of multiple-choice question tests that are used to assess the knowledge of high school students in several academic areas. Dataset Details The dataset was collected in August, 2024. Dataset Description Point of contact: Jekaterina Novikova Language: Lithuanian (lt) License: CC-BY-NC-SA-4.0 Uses Direct Use The dataset is intended to be used as a subset of training data for the development of multilingual language models. It can be used together with the TiLt-Pro dataset, as both follow the exact same structure. How to Use the Dataset Huggingface Datasets: from datasets import load_dataset ds = load_dataset("Jekaterina/tilt-hs") Pandas: import pandas as pd df = pd.read_json("hf://datasets/Jekaterina/tilt-hs/high_school_economics_lt.json") Dataset Structure Data Instances A typical data point comprises some meta data, including language of the test, the country of origin, file source and name, number of the test's specific question, and a license of the test. The main content contains the question, suggested options for the answer and the correct answer. In addition, some information about the nature of the test is provided, including test level and category in both original language and in English. An example from the TiLt-HS dataset looks as follows: { "language": "lt", "country": "Lithuania", "file_name": "Elektroninė-versija-Ekonomikos-ir-finansu-uzduotys-su-atsakymais.pdf", "source": "https://www.nmakademija.lt/wp-content/uploads/2020/08/", "license": "no license", "level": "high school", "category_en": "microeconomics", "category_original_lang": "mikroekonomika", "original_question_num": 1, "question": "Nuperkamų prekių kiekio priklausomybė nuo kainos vadinama", "options": [ "pasiūlos funkcija", "paklausos funkcija", "rinkos pusiausvyra", "sąnaudų padengimo tašku" ], "answer": "2" } Data Fields language: The language of the sample. country: The country of the sample. source: The ID or URL of the source. file_name: The name of the source file. license: License of the source. level: The academic level of the tested knowledge. Can be one of the following: e.g. Middle School, High School, University Entrance, University, Professional etc. category_en: The low level category according to the source IN ENGLISH, i.e. the exam name. category_original_lang: Similarly to the category_en attribute, the low level category according to the source IN THE ORIGINAL LANGUAGE. original_question_num: Id of the sample. question: The text of the multi-choice question or the statement of the True-False question. options: A list of texts of the multiple choice options. In the case of True-False task, it is left blank. answer: The string representing the correct choice(s) (1, 2, 3, ...). It corresponds to the numbering of the respective choice columns. In the case of True-False task, this is "1" for true and "0" for false. Data Splits The dataset is not split and has only a train subset. The dataset contains tests for three academic topics, with the following individual number of datapoints/test questions in each:   Train microeconomics 71 macroeconomics 25 finances and bookkeeping, year 2018 46 Dataset Creation Curation Rationale This dataset was collected as a part of the AYA Expedition initiative, for the Global Exams project. Source Data The original tests for all the topics were downloaded from the website of the National Academy of Students. Exact link to each test question is provided in the source and file_name fields of the dataset. Data Collection and Processing Tests were downloaded during the period of August 25-27, 2024. Only the questions with multiple-choice answers were selected from the tests. Questions depending on images/figures to be answered, were filtered out. Questions that required reading a paragraph & answering were not included either. Only a subset of available professional tests was included in the current version of the dataset. Annotations The dataset does not contain any additional annotations. Personal and Sensitive Information No personal, sensitive or private information included in the dataset. Bias, Risks, and Limitations This dataset only includes the tests for a certain limited number of academic topics. There is a risk that some of the knowledge presented in the tests changes with time, especially in the area of finance and bookkeeping, so it is important to pay attention to the date of the data collection. This dataset only contains of Lithuanian academic tests and does not necessarily generalize to other languages. Recommendations Users should be aware of the risks and limitations of the dataset. Dataset Card Author Jekaterina Novikova Dataset Card Contact Jekaterina Novikova
创建时间:
2024-09-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作