vlhandfo/nordic-sentence-embedding-hard-negatives-cleaned-v2

Name: vlhandfo/nordic-sentence-embedding-hard-negatives-cleaned-v2
Creator: vlhandfo
Published: 2026-03-30 09:34:59
License: 暂无描述

Hugging Face2026-03-30 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/vlhandfo/nordic-sentence-embedding-hard-negatives-cleaned-v2

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: anchor dtype: large_string - name: positive dtype: large_string - name: negative dtype: large_string - name: language dtype: large_string - name: task dtype: large_string - name: id dtype: int64 - name: source dtype: large_string - name: __index_level_0__ dtype: int64 splits: - name: train num_bytes: 320845965 num_examples: 334801 - name: validation num_bytes: 8452703 num_examples: 8811 - name: test num_bytes: 8499865 num_examples: 8811 download_size: 198686997 dataset_size: 337798533 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* language: - da - 'no' - sv task_categories: - text-classification --- # Dataset Card for nordic-sentence-embedding-hard-negatives-cleaned ## Dataset Summary This dataset is a cleaned and filtered triplet dataset for training sentence embedding models with hard negatives. - **Repository**: `vlhandfo/nordic-sentence-embedding-hard-negatives-cleaned-v2` - **Languages**: Danish, Norwegian, Swedish - **Final schema**: `anchor`, `positive`, `negative`, `language`, `task`, `id`, `source` - **Splits**: - `train`: 334,801 - `validation`: 8,811 - `test`: 8,811 Total rows: 352,423 ## Source Data The dataset is derived from: - **Source dataset**: [DDSC/nordic-embedding-training-data](https://huggingface.co/datasets/DDSC/nordic-embedding-training-data) - **Original train rows loaded**: 968,249 Each example in the final dataset carries a `source` field set to: `DDSC/nordic-embedding-training-data` ## Preprocessing ### 1) Task Filtering (hard-negative triplets) Only rows with non-empty text in all three triplet fields were kept: - `query` (renamed to `anchor` in final dataset) - `positive` - `negative` In practice, this removed rows with missing negatives and produced a hard-negative-only triplet set. From the notebook run: - Rows removed due to empty fields: 584,280 - `query_length == 0`: 14 - `positive_length == 0`: 202 - `negative_length == 0`: 584,269 After this filtering and length clipping (below), remaining task distribution is: - `unit-triple`: 185,888 - `retrieval`: 166,535 ### 2) Length Clipping Word-length features were computed for each row: - `query_length` - `positive_length` - `negative_length` On the non-empty subset, percentile thresholds were estimated per field: - 2.5th percentile: - `query_length`: 2 - `positive_length`: 6 - `negative_length`: 6 - 97.5th percentile: - `query_length`: 37 - `positive_length`: 233 - `negative_length`: 133 Rows were removed if **any** length field was outside this percentile band. - Rows removed by clipping: 31,546 This reduces extreme short/long outliers while preserving the bulk of the distribution. ### 3) Split Generation Data was split with stratification over `(language, task)`: - 95% train - 2.5% validation - 2.5% test ## Data Fields - `anchor` (`string`): Query/anchor text (renamed from source `query`) - `positive` (`string`): Semantically relevant text - `negative` (`string`): Hard negative text - `language` (`string`): Language label (`danish`, `norwegian`, `swedish`) - `task` (`string`): Task type (`retrieval` or `unit-triple`) - `id` (`int`): Row identifier from filtered dataframe reset index - `source` (`string`): Constant source dataset identifier

提供机构：

vlhandfo

5,000+

优质数据集

54 个

任务类型

进入经典数据集