cp500/multilingual-automotive-sparse

Name: cp500/multilingual-automotive-sparse
Creator: cp500
Published: 2026-04-27 05:21:55
License: 暂无描述

Hugging Face2026-04-27 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/cp500/multilingual-automotive-sparse

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - sentence-similarity - text-retrieval language: - en - ja - ko size_categories: - 10K<n<100K tags: - splade - sparse-retrieval - cross-lingual - automotive - supply-chain pretty_name: Multilingual Automotive Sparse Retrieval Corpus --- # Multilingual Automotive Sparse Retrieval Corpus A synthetic training corpus for fine-tuning multilingual sparse retrieval models (SPLADE-family) on automotive, supply-chain, and geopolitics content. Designed to teach a model cross-lingual alignment: a Japanese query should retrieve the right English passage, and vice versa. ## Corpus contents | File | Rows | Purpose | |------|------|---------| | `concepts.jsonl` | 4995 | Raw concept records | | `train_triplets.jsonl` | 134880 | Flattened (query, positive, negative) triplets | | `eval_pairs.jsonl` | 4491 | Held-out 10% for retrieval evaluation | ## Concept record schema (`concepts.jsonl`) ```json { "query": {"en": str, "ja": str, "ko": str}, "positive": {"en": str, "ja": str, "ko": str}, "hard_negatives": [{"en": str, "ja": str, "ko": str}, × 5], "easy_negative": {"en": str}, "_meta": { "concept_id": str, "category": str, "tags": [str], "subject": str, "predicate": str, "object": str, "model": "us.anthropic.claude-haiku-4-5-20251001-v1:0", "usage": {"input_tokens": int, "output_tokens": int, ...} } } ``` ## Training triplet schema (`train_triplets.jsonl`) ```json { "query": str, "positive": str, "negative": str, "query_lang": "en"|"ja"|"ko", "positive_lang": "en"|"ja"|"ko", "negative_lang": "en"|"ja"|"ko", "concept_id": str, "category": str, "negative_kind": "hard"|"easy", "negative_rank": int } ``` Each concept fans out into up to 30 training triplets covering all 9 query×positive language-pair combinations, with hard negatives preferentially matching the query language. ## Category distribution | category | count | |----------|-------| | energy_commodities | 850 | | pharma_biotech | 843 | | semiconductors_hardware | 842 | | geopolitics_defense | 840 | | automotive_mobility | 828 | | finance_capital_markets | 792 | ## Generation method Records were synthesized via **Anthropic Claude Haiku 4.5** on AWS Bedrock, using a deterministic stratified sample of 2500 concept seeds drawn from 8 automotive / supply-chain / geopolitics scenarios. The model was prompted to produce faithful translations across EN/JA/KO with native hard-negative phrasing (not translated-from-English). ### Hard negative taxonomy Each concept has 5 hard negatives, deliberately varied: 1. Same entity / different action 2. Similar action / different entity 3. Same region / unrelated industry 4. Similar entity + topic / different polarity or timeframe 5. Adjacent domain ## Usage ```python from datasets import load_dataset # Full corpus ds = load_dataset("cp500/multilingual-automotive-sparse") # Training triplets only triplets = load_dataset("cp500/multilingual-automotive-sparse", data_files="train_triplets.jsonl") # Held-out eval pairs eval_ = load_dataset("cp500/multilingual-automotive-sparse", data_files="eval_pairs.jsonl") ``` ## Intended use - Fine-tuning multilingual SPLADE / sparse retrieval models (MarginMSE, contrastive loss, in-batch negatives). - Cross-lingual retrieval research: measure whether models learn lexical-vocab-space alignment across scripts. - FLOPS-regularization experiments in the automotive-intelligence domain. ## Limitations - All passages are **synthetic** — companies, dollar amounts, and events are plausible but fictional. Not suitable for factual question answering. - Machine-generated CJK may have occasional register issues — native-speaker review has not been performed. - Domain is narrow (automotive / supply chain / geopolitics). Performance on out-of-domain text will require additional corpora. ## License Apache 2.0. Generated with Anthropic Claude — users should review Anthropic's [Acceptable Use Policy](https://www.anthropic.com/legal/aup) for downstream applications. ## Citation ```bibtex @misc{cp500-multilingual-automotive-sparse-2026, author = {Charles P}, title = {Multilingual Automotive Sparse Retrieval Corpus}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/cp500/multilingual-automotive-sparse} } ```

提供机构：

cp500

5,000+

优质数据集

54 个

任务类型

进入经典数据集