five

strombergnlp/bornholmsk

收藏
Hugging Face2022-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/strombergnlp/bornholmsk
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - da license: - cc-by-4.0 multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - original task_categories: - text-generation task_ids: - language-modeling language_bcp47: - da - da-bornholm --- ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** https://github.com/StrombergNLP/bornholmsk - **Repository:** https://github.com/StrombergNLP/bornholmsk - **Paper:** https://aclanthology.org/W19-6138/ - **Leaderboard:** [Needs More Information] - **Point of Contact:** [Leon Derczynski](https://github.com/leondz) ### Dataset Summary This corpus introduces language processing resources and tools for Bornholmsk, a language spoken on the island of Bornholm, with roots in Danish and closely related to Scanian. Sammenfattnijng på borrijnholmst: Dæjnna artikkelijn introduserer natursprågsresurser å varktoi for borrijnholmst, ed språg a dær snakkes på ön Borrijnholm me rødder i danst å i nær familia me skånst. For more details, see the paper [Bornholmsk Natural Language Processing: Resources and Tools](https://aclanthology.org/W19-6138/). ### Supported Tasks and Leaderboards * ### Languages Bornholmsk, a language variant of Danish spoken on the island of Bornholm. bcp47: `da-bornholm` ## Dataset Structure ### Data Instances 13169 lines, 175 167 words, 801 KB ### Data Fields `id`: the sentence ID, `int` `text`: the Bornholmsk text, `string` ### Data Splits Monolithic ## Dataset Creation ### Curation Rationale To gather as much digital Bornholmsk together as possible ### Source Data #### Initial Data Collection and Normalization From many places - see paper for details. Sources include poems, songs, translations from Danish, folk stories, dictionary entries. #### Who are the source language producers? Native speakers of Bornholmsk who have produced works in their native language, or translated them to Danish. Much of the data is the result of a community of Bornholmsk speakers volunteering their time across the island in an effort to capture this endangered language. ### Annotations #### Annotation process No annotations #### Who are the annotators? No annotations ### Personal and Sensitive Information Unknown, but low risk of presence, given the source material ## Considerations for Using the Data ### Social Impact of Dataset The purpose of this dataset is to capture Bornholmsk digitally and provide a way for NLP systems to interact with it, and perhaps even spark interest in dealing with the language. ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators This collection of Bornholmsk is curated by Leon Derczynski and Alex Speed Kjeldsen ### Licensing Information Creative Commons Attribution 4.0 ### Citation Information ``` @inproceedings{derczynski-kjeldsen-2019-bornholmsk, title = "Bornholmsk Natural Language Processing: Resources and Tools", author = "Derczynski, Leon and Kjeldsen, Alex Speed", booktitle = "Proceedings of the 22nd Nordic Conference on Computational Linguistics", month = sep # "{--}" # oct, year = "2019", address = "Turku, Finland", publisher = {Link{\"o}ping University Electronic Press}, url = "https://aclanthology.org/W19-6138", pages = "338--344", } ```
提供机构:
strombergnlp
原始信息汇总

数据集概述

数据集总结

  • 语言: Bornholmsk,一种在Bornholm岛上使用的丹麦语变体,bcp47代码为da-bornholm
  • 目的: 引入处理Bornholmsk语言的自然语言处理资源和工具。
  • 详细信息: 数据集包含13169行,175167个单词,总大小为801 KB。

支持的任务

  • 任务类别: 文本生成
  • 任务ID: 语言建模

数据集结构

  • 数据实例: 包含13169行数据。
  • 数据字段:
    • id: 句子ID,整数类型。
    • text: Bornholmsk文本,字符串类型。
  • 数据分割: 单一结构。

数据集创建

  • 来源数据: 数据来源于多种渠道,包括诗歌、歌曲、丹麦语翻译、民间故事和词典条目。
  • 语言生产者: 母语为Bornholmsk的本地人,他们用母语创作或将其翻译成丹麦语。
  • 注释: 无注释。

使用数据注意事项

  • 社会影响: 旨在数字化捕捉Bornholmsk语言,为自然语言处理系统提供交互方式,并可能激发对该语言的兴趣。
  • 偏见讨论: 信息不足。
  • 其他已知限制: 信息不足。

附加信息

  • 数据集管理员: Leon Derczynski和Alex Speed Kjeldsen。

  • 许可信息: 知识共享署名4.0国际许可。

  • 引用信息:

    @inproceedings{derczynski-kjeldsen-2019-bornholmsk, title = "Bornholmsk Natural Language Processing: Resources and Tools", author = "Derczynski, Leon and Kjeldsen, Alex Speed", booktitle = "Proceedings of the 22nd Nordic Conference on Computational Linguistics", month = sep # "{--}" # oct, year = "2019", address = "Turku, Finland", publisher = {Link{"o}ping University Electronic Press}, url = "https://aclanthology.org/W19-6138", pages = "338--344", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作