NbAiLab/nbnn_language_detection

Name: NbAiLab/nbnn_language_detection
Creator: NbAiLab
Published: 2023-10-12 13:21:41
License: 暂无描述

Hugging Face2023-10-12 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/NbAiLab/nbnn_language_detection

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-classification language: - nb - 'no' - nn size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: "train.jsonl" - split: train_a path: "trainA.jsonl" - split: train_b path: "trainB.jsonl" - split: train_nordic path: "train_nordic.jsonl" - split: train_cleaned path: "train_cleaned.jsonl" - split: dev path: "dev.jsonl" - split: dev_nordic path: "dev_nordic.jsonl" - split: test path: "test.jsonl" - split: test_nordiv path: "test_nordic.jsonl" --- # Dataset Card for Bokmål-Nynorsk Language Detection (main_train_split) ## Dataset Summary This dataset is intended for language detection for Bokmål to Nynorsk and vice versa. It contains 800,000 sentence pairs, sourced from Språkbanken and pruned to avoid overlap with the NorBench dataset. The data comes from translations of news text from Norsk telegrambyrå (NTB), performed by Nynorsk pressekontor (NPK). In addition the dev and test set has 1000 entries. ## Data Collection - **Period**: February 2011 to December 2022 - **Source**: [Omsetjingsminne Nynorsk Pressekontor - Språkbanken](https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-80/) - **Size**: 800,000 sentence pairs - **Format**: JSON-lines (with `text` , `language` fields) ### Processing Steps 1. Pruned to avoid overlap with NorBench 2. Deduplicated 3. Shuffled with a fixed seed (42) ## Usage Intended for training Bokmål-Nynorsk detection models. For more details, refer to the repository where the dataset preparation script and the actual dataset reside.

提供机构：

NbAiLab

原始信息汇总

数据集卡片 for Bokmål-Nynorsk 语言检测 (main_train_split)

数据集概述

该数据集旨在用于 Bokmål 到 Nynorsk 及其反向的语言检测。它包含 800,000 个句子对，源自 Språkbanken 并经过修剪以避免与 NorBench 数据集重叠。数据来自 Nynorsk pressekontor (NPK) 翻译的 Norsk telegrambyrå (NTB) 新闻文本。此外，开发集和测试集各有 1000 个条目。

数据收集

时期: 2011年2月至2022年12月
来源: Omsetjingsminne Nynorsk Pressekontor - Språkbanken
大小: 800,000 个句子对
格式: JSON-lines (包含 text 和 language 字段)

处理步骤

修剪以避免与 NorBench 重叠
去重
使用固定种子 (42) 打乱

使用

旨在用于训练 Bokmål-Nynorsk 检测模型。更多详情，请参考数据集准备脚本和实际数据集所在的仓库。

5,000+

优质数据集

54 个

任务类型

进入经典数据集