five

vlhandfo/nordic-sentence-embedding-hard-negatives-cleaned

收藏
Hugging Face2026-03-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/vlhandfo/nordic-sentence-embedding-hard-negatives-cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: anchor dtype: large_string - name: positive dtype: large_string - name: negative dtype: large_string - name: language dtype: large_string - name: task dtype: large_string - name: id dtype: int64 - name: source dtype: large_string splits: - name: train num_bytes: 267851341 num_examples: 281938 - name: validation num_bytes: 33501951 num_examples: 35242 - name: test num_bytes: 33625857 num_examples: 35243 download_size: 196284042 dataset_size: 334979149 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- # Dataset Card for nordic-sentence-embedding-hard-negatives-cleaned ## Dataset Summary This dataset is a cleaned and filtered triplet dataset for training sentence embedding models with hard negatives. - **Repository**: `vlhandfo/nordic-sentence-embedding-hard-negatives-cleaned` - **Languages**: Danish, Norwegian, Swedish - **Final schema**: `anchor`, `positive`, `negative`, `language`, `task`, `id`, `source` - **Splits**: - `train`: 281,938 - `validation`: 35,242 - `test`: 35,243 Total rows: 352,423 ## Source Data The dataset is derived from: - **Source dataset**: `DDSC/nordic-embedding-training-data` - **Original train rows loaded**: 968,249 Each example in the final dataset carries a `source` field set to: `DDSC/nordic-embedding-training-data` ## Preprocessing ### 1) Task Filtering (hard-negative triplets) Only rows with non-empty text in all three triplet fields were kept: - `query` (renamed to `anchor` in final dataset) - `positive` - `negative` In practice, this removed rows with missing negatives and produced a hard-negative-only triplet set. From the notebook run: - Rows removed due to empty fields: 584,280 - `query_length == 0`: 14 - `positive_length == 0`: 202 - `negative_length == 0`: 584,269 After this filtering and length clipping (below), remaining task distribution is: - `unit-triple`: 185,888 - `retrieval`: 166,535 ### 2) Length Clipping Word-length features were computed for each row: - `query_length` - `positive_length` - `negative_length` On the non-empty subset, percentile thresholds were estimated per field: - 2.5th percentile: - `query_length`: 2 - `positive_length`: 6 - `negative_length`: 6 - 97.5th percentile: - `query_length`: 37 - `positive_length`: 233 - `negative_length`: 133 Rows were removed if **any** length field was outside this percentile band. - Rows removed by clipping: 31,546 This reduces extreme short/long outliers while preserving the bulk of the distribution. ### 3) Split Generation Data was split with stratification over `(language, task)`: - 80% train - 10% validation - 10% test using `random_state=42`. ## Data Fields - `anchor` (`string`): Query/anchor text (renamed from source `query`) - `positive` (`string`): Semantically relevant text - `negative` (`string`): Hard negative text - `language` (`string`): Language label (`danish`, `norwegian`, `swedish`) - `task` (`string`): Task type (`retrieval` or `unit-triple`) - `id` (`int`): Row identifier from filtered dataframe reset index - `source` (`string`): Constant source dataset identifier
提供机构:
vlhandfo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作