five

fhgr/sursilvan-sprachspende-2025

收藏
Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/fhgr/sursilvan-sprachspende-2025
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - question-answering language: - rm tags: - romansh - sursilvan - low-resource pretty_name: Sursilvan Dataset - Sprachspende 2025 size_categories: - n<1K --- # Sursilvan Dataset - Sprachspende 2025 ## Dataset Description This dataset contains authentic conversational exchanges in Sursilvan, a Romansh idiom spoken in the Surselva region of Switzerland. The data was collected through a language donation survey ("Sprachspende") conducted in 2025 by FH Graubünden (FHGR), capturing natural language use across various everyday topics. Participants were native Sursilvan speakers who voluntarily contributed their language samples. ### Dataset Summary - **Language:** Romansh Sursilvan - **Format:** Conversational dialogues in ShareGPT format - **Size:** 775 dialogues from 79 native speakers - **Topics:** Daily activities, traditions, holidays, hobbies, work, social interactions - **Purpose:** Language preservation, low-resource NLP, conversational AI ### Supported Tasks - **Conversational AI:** Training chatbots and dialogue systems - **Text Generation:** Fine-tuning language models for Sursilvan - **Question Answering:** Question-response pairs in authentic Sursilvan - **Cultural Documentation:** Preserving expressions of Sursilvan culture and daily life ## Dataset Structure ### Data Format Each dialogue follows the standard chat format with three messages: ```json { "conversations": [ { "from": "system", "value": "Persuna:\n- vegliadetgna: 30-49 onns\n- schlattaina: masculin\n- liug: Bonaduz, dialect: Lumnezia" }, { "from": "human", "value": "Co festiveschas ti Nadal? Tgei fiastas enconuschas ti?" }, { "from": "gpt", "value": "Cun mia famiglia cun pigniel da Nadal e schenghetgs. Jeu enconuschel pliras fiastas." } ] } ``` ### Data Fields #### Conversations Array - **from**: One of `"system"`, `"human"`, or `"gpt"` - **value**: The message text #### System Message Contains speaker demographics: - **vegliadetgna** (age group): `< 18 onns`, `18-29 onns`, `30-49 onns`, `50-69 onns`, `> 70 onns` - **schlattaina** (gender): `masculin`, `feminin` - **liug** (location): Place of residence and region of origin ## Dataset Statistics | Metric | Value | |--------|-------| | Total dialogues | 775 | | Unique speakers | 79 | | Unique questions | 12 | | Avg. dialogues per speaker | 9.8 | | Avg. question length | 75 characters | | Avg. answer length | 154 characters | | Answer length range | 1-1,008 characters | ### Demographic Distribution **Age Groups:** - 30-49 onns: 280 dialogues (36%) - 50-69 onns: 272 dialogues (35%) - \> 70 onns: 132 dialogues (17%) - 18-29 onns: 79 dialogues (10%) - < 18 onns: 12 dialogues (2%) **Gender:** - feminin: 511 dialogues (66%) - masculin: 264 dialogues (34%) ## Topics Covered The dataset includes natural responses about: 1. **Personal celebrations:** Birthdays, wishes, and gift-giving 2. **Holiday traditions:** Christmas and other festivals 3. **Daily routines:** Morning, afternoon, evening activities, weekend plans 4. **Seasonal activities:** Spring, summer, autumn, winter 5. **Hobbies and leisure:** Sports, music, free time activities 6. **Work and professions:** Career interests and preferences 7. **Social interactions:** Casual conversations and conflicts 8. **Local life:** Descriptions of villages, towns, and favorite places ## Licensing This dataset is released under **CC-BY-4.0** (Creative Commons Attribution 4.0 International). ## Additional Information ### Contact For more information on this project and the team, please visit the project website: fhgr.ch/idiomvoice ### Acknowledgments Special thanks to all participants who generously donated their language samples to support Sursilvan language preservation and research. This project was financially supported by the **Amt für Kultur Graubünden, Sprachenförderung**. We are grateful for their commitment to preserving and promoting the Romansh language.

license: CC-BY-4.0 task_categories: - 文本生成 - 问答 language: - 罗曼什语(ISO代码:rm) tags: - 罗曼什语 - 苏尔西尔万语 - 低资源语言 pretty_name: 苏尔西尔万语数据集——2025语言捐赠计划 size_categories: - n<1K --- # 苏尔西尔万语数据集——2025语言捐赠计划 ## 数据集概述 本数据集收录了瑞士苏尔塞尔瓦地区使用的罗曼什语分支苏尔西尔万语的真实会话交流语料。该数据由格劳宾登应用科技大学(FH Graubünden, FHGR)于2025年通过语言捐赠调查(Sprachspende)采集,涵盖各类日常话题下的自然语言使用场景。参与数据采集的均为苏尔西尔万语母语者,他们自愿贡献了自身的语言样本。 ### 数据集摘要 - **语言**:苏尔西尔万语(罗曼什语分支) - **格式**:ShareGPT格式会话对话 - **规模**:共775段对话,来自79位母语使用者 - **话题**:日常活动、传统习俗、节日假期、兴趣爱好、工作、社交互动 - **用途**:语言保护、低资源自然语言处理(low-resource NLP)、对话式人工智能 ### 支持任务 - **对话式AI**:用于训练聊天机器人与对话系统 - **文本生成**:为苏尔西尔万语微调语言模型 - **问答任务**:提供真实苏尔西尔万语的问答对 - **文化记录**:留存苏尔西尔万语文化与日常生活的表达 ## 数据集结构 ### 数据格式 每段对话遵循标准聊天格式,包含三条消息: json { "conversations": [ { "from": "system", "value": "Persuna: - vegliadetgna: 30-49 onns - schlattaina: masculin - liug: Bonaduz, dialect: Lumnezia" }, { "from": "human", "value": "Co festiveschas ti Nadal? Tgei fiastas enconuschas ti?" }, { "from": "gpt", "value": "Cun mia famiglia cun pigniel da Nadal e schenghetgs. Jeu enconuschel pliras fiastas." } ] } ### 数据字段 #### 对话数组 - **from**:取值为`"system"`、`"human"`或`"gpt"`,分别代表系统提示、用户与助手 - **value**:消息文本内容 #### 系统消息 包含发言者的人口统计学信息: - **年龄组(vegliadetgna)**:可选值为`< 18 onns`、`18-29 onns`、`30-49 onns`、`50-69 onns`、`> 70 onns`(onns为苏尔西尔万语的“岁”) - **性别(schlattaina)**:可选值为`masculin`(男性)、`feminin`(女性) - **所在地(liug)**:居住地点与原籍地区 ## 数据集统计 | 指标 | 数值 | |------|------| | 总对话数 | 775 | | 唯一发言者数 | 79 | | 唯一问题数 | 12 | | 每位发言者平均对话数 | 9.8 | | 平均问题长度 | 75个字符 | | 平均回答长度 | 154个字符 | | 回答长度范围 | 1~1008个字符 | ### 人口统计学分布 **年龄分布**: - 30-49岁(onns):280段对话(占比36%) - 50-69岁(onns):272段对话(占比35%) - 70岁以上(onns):132段对话(占比17%) - 18-29岁(onns):79段对话(占比10%) - 18岁以下(onns):12段对话(占比2%) **性别分布**: - 女性(feminin):511段对话(占比66%) - 男性(masculin):264段对话(占比34%) ## 覆盖话题 本数据集涵盖了关于以下主题的自然会话内容: 1. **个人庆祝活动**:生日、祝福与送礼 2. **节日传统**:圣诞节与其他节庆活动 3. **日常作息**:早、中、晚活动与周末计划 4. **季节性活动**:春、夏、秋、冬相关活动 5. **兴趣爱好与休闲**:运动、音乐与闲暇活动 6. **工作与职业**:职业兴趣与偏好 7. **社交互动**:日常对话与人际冲突 8. **本地生活**:村庄、城镇与喜爱场所的描述 ## 许可协议 本数据集采用**CC-BY-4.0(知识共享署名4.0国际协议)**许可发布。 ## 附加信息 ### 联系方式 如需了解本项目及团队更多信息,请访问项目官网:fhgr.ch/idiomvoice ### 致谢 特别感谢所有慷慨捐赠语言样本的参与者,为苏尔西尔万语的语言保护与研究提供支持。 本项目获得了**格劳宾登州文化局语言促进项目(Amt für Kultur Graubünden, Sprachenförderung)**的资金支持,谨此感谢其为保护与推广罗曼什语所做出的贡献。
提供机构:
fhgr
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作