five

alban-labs/Kapibara

收藏
Hugging Face2024-08-12 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/alban-labs/Kapibara
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 datasets: - LDJnr/Capybara language: - en - sq tags: - Physics - Biology - Math - Chemistry - Culture - Logic - Roleplay size_categories: - 10K<n<100K task_categories: - text-generation - question-answering --- # Kapibara: Albanian Multi-turn Conversation Dataset ## Dataset Summary Kapibara is a comprehensive Albanian language dataset designed for multi-turn conversations. It contains over 5,300 entries covering a wide range of topics including physics, biology, mathematics, chemistry, culture, and logic. The dataset is aimed at improving text generation and question-answering capabilities in the Albanian language. ## Supported Tasks The dataset supports the following NLP tasks: - Text Generation - Question Answering ## Languages The dataset primarily contains conversations in Albanian (sq), with some entries also including English (en) translations or references. ## Dataset Structure ### Data Instances Each instance in the dataset represents a multi-turn conversation. Here's an example structure: ```json { "source": "General-Instruct", "conversation": [ { "input": "Kryeni një detyrë shkrimi krijues: Një person që zbulon se mund të flasë me kafshët.", "output": "Ema gjithmonë kishte ndjerë një lidhje të thellë me kafshët, por ishte një pasdite me shi kur zbuloi dhuntinë e saj të vërtetë. ..." }, { "input": "Diskutoni implikimet psikologjike të aftësisë së Emës për të komunikuar me kafshët në jetën e saj personale dhe sociale.", "output": "Aftësia e re e Emës për të komunikuar me kafshët mund të ketë implikime të thella psikologjike si në jetën e saj personale ashtu edhe në atë sociale. ..." } ] } ``` ### Data Fields source: The source or category of the conversation. conversation: A list of conversation turns. input: The input or question in the conversation. output: The corresponding output or answer. ### Data Splits The dataset is currently provided as a single file: rough5300entries.jsonl, containing approximately 5,300 conversation entries. Dataset Creation Curation Rationale The Kapibara dataset was created to address the lack of comprehensive, multi-turn conversation datasets in the Albanian language. It aims to provide a rich resource for developing and testing language models capable of understanding and generating Albanian text across various domains. Source Data The conversations in this dataset were carefully curated and generated to cover a wide range of topics relevant to Albanian culture and general knowledge. Annotations The dataset does not contain additional annotations beyond the conversation structure. Considerations for Using the Data Social Impact of Dataset This dataset aims to improve NLP capabilities in the Albanian language, potentially leading to better language technologies and applications for Albanian speakers. Discussion of Biases While efforts have been made to cover a diverse range of topics, users should be aware of potential biases in the dataset, including but not limited to topic selection and language style. Other Known Limitations The dataset is limited to text-based conversations and does not include other modalities such as images or audio. ## Citation If you use this data in your work, please cite: ```bibtex @article{daniel2024llm, title={MultiLLM Mix for Data Mutation and Synthesis}, author={Nisten Tahiraj, Daniel Merja, Benjamin Shehu, Jeton Kukalaj and Amittai Groot}, journal={arXiv preprint arXiv:(comming soon)}, year={2024} } ```
提供机构:
alban-labs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作