five

smartcat/serbian_qa

收藏
Hugging Face2024-10-07 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/smartcat/serbian_qa
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - question-answering language: - sr pretty_name: Serbian QA dataset size_categories: - 1K<n<10K --- # Dataset Card for "serbian_qa" ## Dataset Description - **Repository:** [https://huggingface.co/datasets/smartcat/serbian_qa] - **Point of Contact:** [SmartCat.io] ### Dataset Summary The "serbian_qa" dataset is a collection of context-query pairs in Serbian. It is designed for question-answering tasks and contains contexts from various Serbian language sources, paired with automatically generated queries of different lengths. ### Supported Tasks and Leaderboards - **Tasks:** Question Answering, Information Retrieval ### Languages The dataset is in Serbian (sr). ## Dataset Structure ### Data Instances Each instance in the dataset consists of: - A context (text passage) - Three queries related to the context: - A long query - A medium query - A short query - Keywords for each query - Scores for each query ### Data Fields - `context`: string - `long_query`: string - `medium_query`: string - `short_query`: string - `long_query_keywords`: list of strings - `medium_query_keywords`: list of strings - `short_query_keywords`: list of strings - `long_query_score`: float - `medium_query_score`: float - `short_query_score`: float ## Dataset Creation ### Curation Rationale This dataset was created to provide a resource for Serbian language question-answering tasks, utilizing diverse Serbian language sources. ### Source Data #### Initial Data Collection and Normalization Contexts were obtained by applying semantic chunking to subsets of the following datasets: 1. SrpWiki: A Serbian Wikipedia dataset - Available at: https://huggingface.co/datasets/jerteh/SrpWiki 2. SrpKorNews: A Serbian news dataset - Available at: https://huggingface.co/datasets/jerteh/SrpKorNews 3. SrpELTeC: A novel from this dataset was used - Available at: https://huggingface.co/datasets/jerteh/SrpELTeC #### Who are the source language producers? The source corpus are produced by Language Technology Society, JeRTeh. ### Annotations #### Annotation process Queries were automatically generated using the GPT-4o model. For each context, three types of queries were generated: 1. A long query 2. A medium query 3. A short query Additionally, keywords and scores were generated for each query. #### Who are the annotators? The annotations (queries, keywords, and scores) were generated automatically by the GPT-4o model. ## Considerations for Using the Data ### Social Impact of Dataset This dataset contributes to the development of NLP tools and research for the Serbian language, potentially improving Serbian language technology and applications. ### Discussion of Biases As the queries were generated automatically, there may be biases inherited from the GPT-4o model. Users should be aware of potential biases in the generated questions and evaluate the dataset's suitability for their specific use cases. ### Other Known Limitations - The dataset was initially evaluated on a sample, but not all examples in the final dataset have been manually verified. - There might be errors in the generated queries, such as incorrect grammar or queries relating to information that does not exist in the contexts. - The quality and relevance of the generated queries may vary. ## Using the Dataset ### Loading the Dataset To load the dataset using the Hugging Face `datasets` library, you can use the following code: ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("smartcat/serbian_qa") # Access the data for example in dataset['train']: # or 'validation' or 'test', depending on your splits context = example['context'] long_query = example['long_query'] medium_query = example['medium_query'] short_query = example['short_query'] # ... access other fields as needed # Print an example print(dataset['train'][0]) ``` ### Data Processing Here's a simple example of how you might process the data for a question-answering task: ```python def preprocess_function(examples): questions = [q for q in examples["long_query"]] # You can choose long, medium, or short queries inputs = [f"question: {q} context: {c}" for q, c in zip(questions, examples["context"])] return inputs # Apply the preprocessing to the dataset preprocessed_dataset = dataset.map(preprocess_function, batched=True) ``` This example shows how to combine the questions and contexts, which is a common preprocessing step for question-answering models. You may need to adjust this based on your specific use case and the model you're using. Remember to handle the data appropriately and consider any limitations mentioned in the dataset card when using the dataset. ### Dataset Curators [SmartCat.io]
提供机构:
smartcat
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作