barbaroo/Faroese_BLARK_small
收藏Hugging Face2024-05-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/barbaroo/Faroese_BLARK_small
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-generation
language:
- fo
license: cc-by-4.0
---
# Dataset Card for Faroese_BLARK_small
## Dataset Description
All sentences are retrieved from:
- **Paper:**
Annika Simonsen, Sandra Saxov Lamhauge, Iben Nyholm Debess, and Peter Juel Henrichsen. 2022. Creating a Basic Language Resource Kit for Faroese. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4637–4643, Marseille, France. European Language Resources Association.
### Dataset Summary
This dataset is a filtered version of the corpus (35.6 M tokens) first published as BLARK - Basic Language Resource Kit for Faroese.
The pre-processing and filtering steps include:
- Normalize format to utf-8
- Remove shorter sentences (less than 10 units, where units are separated by spaces)
- Remove archaic Faroese
- Remove separators ('\r', '\t', '\n')
- Remove non standard formatting. Examples: '§§', ' | ', '**', ' • ', ' • ', '.- ', ': ?', '.?', '\xa0', '\xad', '_ _', '. .', etc.
- Remove (most) numbered lists, of formats: 1), 1:, Stk. 1 etc.
- Replace arbitrary number of question/exclamation marks and full-stops with 1. Example: !!!!!! -> !
- Remove websites that start with http
- Remove sentences without (or with little) linguistic content. In practice: all sentences where more than half of the characters (excluding spaces) are number, punctuations and letters in caps-lock (acronyms and initials)
- Remove duplicates
### Supported Tasks and Leaderboards
Suitable for MLM and CLM
提供机构:
barbaroo
原始信息汇总
数据集卡片:Faroese_BLARK_small
数据集描述
数据集概述
该数据集是首次发布的BLARK(Faroese基本语言资源工具包)语料库(35.6M个词)的过滤版本。预处理和过滤步骤包括:
- 格式标准化为utf-8
- 移除短句(少于10个单元,单元由空格分隔)
- 移除古法罗语
- 移除分隔符( , , )
- 移除非标准格式。例如:§§, | , **, • , • , .- , : ?, .?, xa0, xad, _ _, . .等
- 移除(大多数)编号列表,格式如:1), 1:, Stk. 1等
- 将任意数量的问号/感叹号和句号替换为1个。例如:!!!!!! -> !
- 移除以http开头的网站
- 移除没有(或很少)语言内容的句子。实际上:所有句子中超过一半的字符(不包括空格)是数字、标点和大写锁定字母(缩写和首字母)
- 移除重复句子
支持的任务和排行榜
适用于MLM和CLM任务。



