vlhandfo/nordic-sentence-embedding-hard-negatives-cleaned-v2
收藏Hugging Face2026-03-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/vlhandfo/nordic-sentence-embedding-hard-negatives-cleaned-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: anchor
dtype: large_string
- name: positive
dtype: large_string
- name: negative
dtype: large_string
- name: language
dtype: large_string
- name: task
dtype: large_string
- name: id
dtype: int64
- name: source
dtype: large_string
- name: __index_level_0__
dtype: int64
splits:
- name: train
num_bytes: 320845965
num_examples: 334801
- name: validation
num_bytes: 8452703
num_examples: 8811
- name: test
num_bytes: 8499865
num_examples: 8811
download_size: 198686997
dataset_size: 337798533
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
language:
- da
- 'no'
- sv
task_categories:
- text-classification
---
# Dataset Card for nordic-sentence-embedding-hard-negatives-cleaned
## Dataset Summary
This dataset is a cleaned and filtered triplet dataset for training sentence embedding models with hard negatives.
- **Repository**: `vlhandfo/nordic-sentence-embedding-hard-negatives-cleaned-v2`
- **Languages**: Danish, Norwegian, Swedish
- **Final schema**: `anchor`, `positive`, `negative`, `language`, `task`, `id`, `source`
- **Splits**:
- `train`: 334,801
- `validation`: 8,811
- `test`: 8,811
Total rows: 352,423
## Source Data
The dataset is derived from:
- **Source dataset**: [DDSC/nordic-embedding-training-data](https://huggingface.co/datasets/DDSC/nordic-embedding-training-data)
- **Original train rows loaded**: 968,249
Each example in the final dataset carries a `source` field set to:
`DDSC/nordic-embedding-training-data`
## Preprocessing
### 1) Task Filtering (hard-negative triplets)
Only rows with non-empty text in all three triplet fields were kept:
- `query` (renamed to `anchor` in final dataset)
- `positive`
- `negative`
In practice, this removed rows with missing negatives and produced a hard-negative-only triplet set.
From the notebook run:
- Rows removed due to empty fields: 584,280
- `query_length == 0`: 14
- `positive_length == 0`: 202
- `negative_length == 0`: 584,269
After this filtering and length clipping (below), remaining task distribution is:
- `unit-triple`: 185,888
- `retrieval`: 166,535
### 2) Length Clipping
Word-length features were computed for each row:
- `query_length`
- `positive_length`
- `negative_length`
On the non-empty subset, percentile thresholds were estimated per field:
- 2.5th percentile:
- `query_length`: 2
- `positive_length`: 6
- `negative_length`: 6
- 97.5th percentile:
- `query_length`: 37
- `positive_length`: 233
- `negative_length`: 133
Rows were removed if **any** length field was outside this percentile band.
- Rows removed by clipping: 31,546
This reduces extreme short/long outliers while preserving the bulk of the distribution.
### 3) Split Generation
Data was split with stratification over `(language, task)`:
- 95% train
- 2.5% validation
- 2.5% test
## Data Fields
- `anchor` (`string`): Query/anchor text (renamed from source `query`)
- `positive` (`string`): Semantically relevant text
- `negative` (`string`): Hard negative text
- `language` (`string`): Language label (`danish`, `norwegian`, `swedish`)
- `task` (`string`): Task type (`retrieval` or `unit-triple`)
- `id` (`int`): Row identifier from filtered dataframe reset index
- `source` (`string`): Constant source dataset identifier
提供机构:
vlhandfo



