five

cometadata/affiliation-disambiguation-triplets

收藏
Hugging Face2026-01-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cometadata/affiliation-disambiguation-triplets
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - multilingual license: cc0-1.0 task_categories: - sentence-similarity - text-classification tags: - affiliations - ror - triplet-loss - contrastive-learning - curriculum-learning pretty_name: Affiliation Triplets for Embedding Training size_categories: - 1M<n<10M configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: anchor dtype: string - name: anchor_ror_id dtype: string - name: positive dtype: string - name: positive_ror_id dtype: string - name: negative dtype: string - name: negative_ror_id dtype: string - name: positive_similarity dtype: float64 - name: negative_similarity dtype: float64 - name: difficulty dtype: float64 - name: negative_type dtype: string - name: triplet_id dtype: int64 splits: - name: train num_bytes: 495130870 num_examples: 1083631 download_size: 221175576 dataset_size: 495130870 --- # Affiliation Triplets for Contrastive Learning This dataset contains 1,083,631 triplets (anchor, positive, negative) for training affiliation embedding and reranking models using triplet loss or contrastive learning. ## Dataset Description Each triplet consists of: - Anchor: An affiliation string from OpenAlex - Positive: A different affiliation string for the same organization (same ROR ID) - Negative: An affiliation string for a different organization The dataset is sorted by difficulty (descending) to support curriculum learning. ## Schema | Field | Type | Description | |-------|------|-------------| | `triplet_id` | int | Sequential ID (1-indexed), sorted by difficulty | | `anchor` | string | The anchor affiliation text | | `anchor_ror_id` | string | ROR ID of the anchor affiliation | | `positive` | string | Positive affiliation (same org as anchor) | | `positive_ror_id` | string | ROR ID of positive (same as anchor) | | `negative` | string | Negative affiliation (different org) | | `negative_ror_id` | string | ROR ID of negative (different from anchor) | | `positive_similarity` | float | Cosine similarity between anchor and positive embeddings | | `negative_similarity` | float | Cosine similarity between anchor and negative embeddings | | `difficulty` | float | `positive_similarity - negative_similarity` (higher = easier) | | `negative_type` | string | `"hard"` (from API candidates) or `"easy"` (random) | ## Statistics | Metric | Value | |--------|-------| | Total triplets | 1,083,631 | | Hard negatives | 592,300 (54.7%) | | Easy negatives | 491,331 (45.3%) | | Difficulty range | 0.00 - 1.19 | | Mean difficulty | 0.26 | | Mean positive similarity | 0.50 | | Mean negative similarity | 0.24 | ## Data Pipeline This dataset was created through a multi-stage pipeline starting from OpenAlex affiliation data. Ite begins by loading affiliation strings from OpenAlex that have been assigned ROR IDs, along with sample weights indicating how frequently each affiliation appears. Next, each affiliation undergoes validation against ROR data. We verify that any country mentioned in the affiliation text matches the country in the ROR record, and that the organization's name actually appears somewhere in the affiliation string. This filtering removes assignments that are potentially incorrect. Validated affiliations then go through a matching stage where we query the ROR affiliation matching API. We confirm that the assigned ROR ID is the same as that assigned in OpenAlex, and we collect the other candidate ROR IDs that the API returned. These candidates represent organizations that could plausibly be confused with the correct one—they become our hard negatives. With confirmed matches in hand, we generate embeddings for all unique affiliation texts using [SIRIS-Lab/affilgood-dense-retriever](https://huggingface.co/SIRIS-Lab/affilgood-dense-retriever), producing 1024-dimensional vectors that are pre-normalized for computing cosine similarity. We then constructs the training examples. For each anchor affiliation, we select hard negatives by finding the most similar affiliation from each candidate ROR ID, and easy negatives by randomly sampling from organizations not in the candidate set. For each negative, we find a positive (another affiliation for the same organization) that has higher similarity to the anchor than the negative does to ensure the triplet provides a valid learning signal. Finally, we sort all triplets by difficulty in descending order and assign sequential IDs. Difficulty is computed as the gap between positive and negative similarity scores. Higher difficulty means the positive is much more similar than the negative (an easy example); lower difficulty means they're close (a hard example). ## Negative Types - Hard negatives (54.7%): From ROR API candidate results - these are organizations that the API considered similar to the anchor, making them challenging negatives - Easy negatives (45.3%): Random organizations not in the API candidates - these provide baseline contrast ## Related Datasets - [cometadata/triplet-loss-affiliation-intermediates](https://huggingface.co/datasets/cometadata/triplet-loss-affiliation-intermediates) - Validated affiliations and confirmed matches - [cometadata/triplet-loss-affiliation-embeddings](https://huggingface.co/datasets/cometadata/triplet-loss-affiliation-embeddings) - Pre-computed embeddings ## Citation If you use this dataset, please cite: ```bibtex @dataset{affiliation_triplets_2025, title={Affiliation Triplets for Embedding Training}, author={CoMetaData}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/datasets/cometadata/affiliation-disambiguation-triplets} } ``` ## License CC0-1.0
提供机构:
cometadata
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作