five

barbaroo/Faroese_BLARK_small

收藏
Hugging Face2024-05-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/barbaroo/Faroese_BLARK_small
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-generation language: - fo license: cc-by-4.0 --- # Dataset Card for Faroese_BLARK_small ## Dataset Description All sentences are retrieved from: - **Paper:** Annika Simonsen, Sandra Saxov Lamhauge, Iben Nyholm Debess, and Peter Juel Henrichsen. 2022. Creating a Basic Language Resource Kit for Faroese. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4637–4643, Marseille, France. European Language Resources Association. ### Dataset Summary This dataset is a filtered version of the corpus (35.6 M tokens) first published as BLARK - Basic Language Resource Kit for Faroese. The pre-processing and filtering steps include: - Normalize format to utf-8 - Remove shorter sentences (less than 10 units, where units are separated by spaces) - Remove archaic Faroese - Remove separators ('\r', '\t', '\n') - Remove non standard formatting. Examples: '§§', ' | ', '**', ' • ', ' • ', '.- ', ': ?', '.?', '\xa0', '\xad', '_ _', '. .', etc. - Remove (most) numbered lists, of formats: 1), 1:, Stk. 1 etc. - Replace arbitrary number of question/exclamation marks and full-stops with 1. Example: !!!!!! -> ! - Remove websites that start with http - Remove sentences without (or with little) linguistic content. In practice: all sentences where more than half of the characters (excluding spaces) are number, punctuations and letters in caps-lock (acronyms and initials) - Remove duplicates ### Supported Tasks and Leaderboards Suitable for MLM and CLM
提供机构:
barbaroo
原始信息汇总

数据集卡片:Faroese_BLARK_small

数据集描述

数据集概述

该数据集是首次发布的BLARK(Faroese基本语言资源工具包)语料库(35.6M个词)的过滤版本。预处理和过滤步骤包括:

  • 格式标准化为utf-8
  • 移除短句(少于10个单元,单元由空格分隔)
  • 移除古法罗语
  • 移除分隔符( , , )
  • 移除非标准格式。例如:§§, | , **, • , • , .- , : ?, .?, xa0, xad, _ _, . .等
  • 移除(大多数)编号列表,格式如:1), 1:, Stk. 1等
  • 将任意数量的问号/感叹号和句号替换为1个。例如:!!!!!! -> !
  • 移除以http开头的网站
  • 移除没有(或很少)语言内容的句子。实际上:所有句子中超过一半的字符(不包括空格)是数字、标点和大写锁定字母(缩写和首字母)
  • 移除重复句子

支持的任务和排行榜

适用于MLM和CLM任务。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作