NbAiLab/nbnn_language_detection
收藏Hugging Face2023-10-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/NbAiLab/nbnn_language_detection
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-classification
language:
- nb
- 'no'
- nn
size_categories:
- 100K<n<1M
configs:
- config_name: default
data_files:
- split: train
path: "train.jsonl"
- split: train_a
path: "trainA.jsonl"
- split: train_b
path: "trainB.jsonl"
- split: train_nordic
path: "train_nordic.jsonl"
- split: train_cleaned
path: "train_cleaned.jsonl"
- split: dev
path: "dev.jsonl"
- split: dev_nordic
path: "dev_nordic.jsonl"
- split: test
path: "test.jsonl"
- split: test_nordiv
path: "test_nordic.jsonl"
---
# Dataset Card for Bokmål-Nynorsk Language Detection (main_train_split)
## Dataset Summary
This dataset is intended for language detection for Bokmål to Nynorsk and vice versa. It contains 800,000 sentence pairs, sourced from Språkbanken and pruned to avoid overlap with the NorBench dataset. The data comes from translations of news text from Norsk telegrambyrå (NTB), performed by Nynorsk pressekontor (NPK). In addition the dev and test set has 1000 entries.
## Data Collection
- **Period**: February 2011 to December 2022
- **Source**: [Omsetjingsminne Nynorsk Pressekontor - Språkbanken](https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-80/)
- **Size**: 800,000 sentence pairs
- **Format**: JSON-lines (with `text` , `language` fields)
### Processing Steps
1. Pruned to avoid overlap with NorBench
2. Deduplicated
3. Shuffled with a fixed seed (42)
## Usage
Intended for training Bokmål-Nynorsk detection models. For more details, refer to the repository where the dataset preparation script and the actual dataset reside.
提供机构:
NbAiLab
原始信息汇总
数据集卡片 for Bokmål-Nynorsk 语言检测 (main_train_split)
数据集概述
该数据集旨在用于 Bokmål 到 Nynorsk 及其反向的语言检测。它包含 800,000 个句子对,源自 Språkbanken 并经过修剪以避免与 NorBench 数据集重叠。数据来自 Nynorsk pressekontor (NPK) 翻译的 Norsk telegrambyrå (NTB) 新闻文本。此外,开发集和测试集各有 1000 个条目。
数据收集
- 时期: 2011年2月至2022年12月
- 来源: Omsetjingsminne Nynorsk Pressekontor - Språkbanken
- 大小: 800,000 个句子对
- 格式: JSON-lines (包含
text和language字段)
处理步骤
- 修剪以避免与 NorBench 重叠
- 去重
- 使用固定种子 (42) 打乱
使用
旨在用于训练 Bokmål-Nynorsk 检测模型。更多详情,请参考数据集准备脚本和实际数据集所在的仓库。



