five

NbAiLab/nbnn_language_detection

收藏
Hugging Face2023-10-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/NbAiLab/nbnn_language_detection
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-classification language: - nb - 'no' - nn size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: "train.jsonl" - split: train_a path: "trainA.jsonl" - split: train_b path: "trainB.jsonl" - split: train_nordic path: "train_nordic.jsonl" - split: train_cleaned path: "train_cleaned.jsonl" - split: dev path: "dev.jsonl" - split: dev_nordic path: "dev_nordic.jsonl" - split: test path: "test.jsonl" - split: test_nordiv path: "test_nordic.jsonl" --- # Dataset Card for Bokmål-Nynorsk Language Detection (main_train_split) ## Dataset Summary This dataset is intended for language detection for Bokmål to Nynorsk and vice versa. It contains 800,000 sentence pairs, sourced from Språkbanken and pruned to avoid overlap with the NorBench dataset. The data comes from translations of news text from Norsk telegrambyrå (NTB), performed by Nynorsk pressekontor (NPK). In addition the dev and test set has 1000 entries. ## Data Collection - **Period**: February 2011 to December 2022 - **Source**: [Omsetjingsminne Nynorsk Pressekontor - Språkbanken](https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-80/) - **Size**: 800,000 sentence pairs - **Format**: JSON-lines (with `text` , `language` fields) ### Processing Steps 1. Pruned to avoid overlap with NorBench 2. Deduplicated 3. Shuffled with a fixed seed (42) ## Usage Intended for training Bokmål-Nynorsk detection models. For more details, refer to the repository where the dataset preparation script and the actual dataset reside.
提供机构:
NbAiLab
原始信息汇总

数据集卡片 for Bokmål-Nynorsk 语言检测 (main_train_split)

数据集概述

该数据集旨在用于 Bokmål 到 Nynorsk 及其反向的语言检测。它包含 800,000 个句子对,源自 Språkbanken 并经过修剪以避免与 NorBench 数据集重叠。数据来自 Nynorsk pressekontor (NPK) 翻译的 Norsk telegrambyrå (NTB) 新闻文本。此外,开发集和测试集各有 1000 个条目。

数据收集

处理步骤

  1. 修剪以避免与 NorBench 重叠
  2. 去重
  3. 使用固定种子 (42) 打乱

使用

旨在用于训练 Bokmål-Nynorsk 检测模型。更多详情,请参考数据集准备脚本和实际数据集所在的仓库。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作