vlhandfo/nordic-sentence-embedding-hard-negatives-cleaned

Name: vlhandfo/nordic-sentence-embedding-hard-negatives-cleaned
Creator: vlhandfo
Published: 2026-03-10 08:41:29
License: 暂无描述

Hugging Face2026-03-10 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/vlhandfo/nordic-sentence-embedding-hard-negatives-cleaned

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: anchor dtype: large_string - name: positive dtype: large_string - name: negative dtype: large_string - name: language dtype: large_string - name: task dtype: large_string - name: id dtype: int64 - name: source dtype: large_string splits: - name: train num_bytes: 267851341 num_examples: 281938 - name: validation num_bytes: 33501951 num_examples: 35242 - name: test num_bytes: 33625857 num_examples: 35243 download_size: 196284042 dataset_size: 334979149 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- # Dataset Card for nordic-sentence-embedding-hard-negatives-cleaned ## Dataset Summary This dataset is a cleaned and filtered triplet dataset for training sentence embedding models with hard negatives. - **Repository**: `vlhandfo/nordic-sentence-embedding-hard-negatives-cleaned` - **Languages**: Danish, Norwegian, Swedish - **Final schema**: `anchor`, `positive`, `negative`, `language`, `task`, `id`, `source` - **Splits**: - `train`: 281,938 - `validation`: 35,242 - `test`: 35,243 Total rows: 352,423 ## Source Data The dataset is derived from: - **Source dataset**: `DDSC/nordic-embedding-training-data` - **Original train rows loaded**: 968,249 Each example in the final dataset carries a `source` field set to: `DDSC/nordic-embedding-training-data` ## Preprocessing ### 1) Task Filtering (hard-negative triplets) Only rows with non-empty text in all three triplet fields were kept: - `query` (renamed to `anchor` in final dataset) - `positive` - `negative` In practice, this removed rows with missing negatives and produced a hard-negative-only triplet set. From the notebook run: - Rows removed due to empty fields: 584,280 - `query_length == 0`: 14 - `positive_length == 0`: 202 - `negative_length == 0`: 584,269 After this filtering and length clipping (below), remaining task distribution is: - `unit-triple`: 185,888 - `retrieval`: 166,535 ### 2) Length Clipping Word-length features were computed for each row: - `query_length` - `positive_length` - `negative_length` On the non-empty subset, percentile thresholds were estimated per field: - 2.5th percentile: - `query_length`: 2 - `positive_length`: 6 - `negative_length`: 6 - 97.5th percentile: - `query_length`: 37 - `positive_length`: 233 - `negative_length`: 133 Rows were removed if **any** length field was outside this percentile band. - Rows removed by clipping: 31,546 This reduces extreme short/long outliers while preserving the bulk of the distribution. ### 3) Split Generation Data was split with stratification over `(language, task)`: - 80% train - 10% validation - 10% test using `random_state=42`. ## Data Fields - `anchor` (`string`): Query/anchor text (renamed from source `query`) - `positive` (`string`): Semantically relevant text - `negative` (`string`): Hard negative text - `language` (`string`): Language label (`danish`, `norwegian`, `swedish`) - `task` (`string`): Task type (`retrieval` or `unit-triple`) - `id` (`int`): Row identifier from filtered dataframe reset index - `source` (`string`): Constant source dataset identifier

提供机构：

vlhandfo

5,000+

优质数据集

54 个

任务类型

进入经典数据集