five

SIRIS-Lab/impuls-wikidata-kb

收藏
Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/SIRIS-Lab/impuls-wikidata-kb
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - ca - es - en - it task_categories: - feature-extraction - text-generation tags: - knowledge-base - wikidata - multilingual - R&D - query-expansion - semantic-search - catalan - spanish - italian - AINA size_categories: - 1K<n<10K --- # IMPULS R&D Knowledge Base A multilingual knowledge base of 4,265 R&D concepts derived from Wikidata, designed for query expansion in scientific and research project search systems. ## Dataset Description This knowledge base was created as part of the **IMPULS project** (AINA Challenge 2024), a collaboration between [SIRIS Academic](https://sirisacademic.com/) and [Generalitat de Catalunya](https://web.gencat.cat/) to build a multilingual semantic search system for R&D ecosystems. The KB contains scientific and technological concepts with: - **Multilingual labels** in Catalan, Spanish, English, and Italian - **Aliases/synonyms** for each language - **Definitions** where available - **Hierarchical relationships** (instance_of, subclass_of) linking to Wikidata ### Use Cases - **Query Expansion**: Expand search queries with synonyms and related terms across languages - **Multilingual Search**: Find equivalent terms across CA/ES/EN/IT - **Concept Navigation**: Traverse hierarchical relationships for broader/narrower terms - **Named Entity Linking**: Link mentions to Wikidata identifiers ## Dataset Structure ### Format JSONL (JSON Lines) - one concept per line. ### Schema ```json { "keyword": "machine learning", "wikidata_id": "Q2539", "languages": { "ca": { "label": "aprenentatge automàtic", "description": "branca de la intel·ligència artificial", "also_known_as": ["aprenentatge de màquines", "ML"] }, "es": { "label": "aprendizaje automático", "description": "rama de la inteligencia artificial", "also_known_as": ["aprendizaje de máquina", "ML"] }, "en": { "label": "machine learning", "description": "branch of artificial intelligence", "also_known_as": ["ML", "statistical learning"] }, "it": { "label": "apprendimento automatico", "description": "ramo dell'intelligenza artificiale", "also_known_as": [] } }, "instance_of": [ {"id": "Q11660", "label": "artificial intelligence"} ], "subclass_of": [ {"id": "Q11660", "label": "artificial intelligence"}, {"id": "Q816264", "label": "computational learning theory"} ] } ``` ### Field Descriptions | Field | Type | Description | |-------|------|-------------| | `keyword` | string | Primary English keyword | | `wikidata_id` | string | Wikidata entity ID (Q-number) | | `languages` | object | Multilingual labels, descriptions, and aliases | | `languages.{lang}.label` | string | Primary label in that language | | `languages.{lang}.description` | string | Short description/definition | | `languages.{lang}.also_known_as` | array | Alternative names/synonyms | | `instance_of` | array | Wikidata instance_of relations | | `subclass_of` | array | Wikidata subclass_of relations (for hierarchy traversal) | ## Statistics | Metric | Value | |--------|-------| | Total concepts | 4,265 | | With Catalan labels | ~4,200 | | With Spanish labels | ~4,250 | | With English labels | 4,265 | | With Italian labels | ~4,100 | | With subclass_of relations | ~3,590 | | Unique parent concepts | ~770 (in KB) | ### Domain Coverage The KB focuses on R&D-relevant concepts including: - **Technology**: AI, blockchain, IoT, robotics, quantum computing - **Science**: biotechnology, nanotechnology, materials science - **Health**: medical devices, diagnostics, pharmaceuticals - **Energy**: renewables, hydrogen, energy storage - **Environment**: climate, sustainability, circular economy - **Industry**: manufacturing, automation, Industry 4.0 ## Examples ### Example 1: Technology Concept ```json { "keyword": "blockchain", "wikidata_id": "Q20514253", "languages": { "ca": { "label": "cadena de blocs", "description": "estructura de dades distribuïda", "also_known_as": ["blockchain"] }, "es": { "label": "cadena de bloques", "description": "base de datos distribuida", "also_known_as": ["blockchain"] }, "en": { "label": "blockchain", "description": "distributed database technology", "also_known_as": ["block chain", "distributed ledger"] } }, "subclass_of": [ {"id": "Q8513", "label": "database"} ] } ``` ### Example 2: Health Concept ```json { "keyword": "patient", "wikidata_id": "Q181600", "languages": { "ca": { "label": "pacient", "description": "", "also_known_as": [] }, "es": { "label": "paciente", "description": "persona que recibe tratamiento para un problema de salud", "also_known_as": ["pacientes", "enfermo"] }, "en": { "label": "patient", "description": "person who takes a medical treatment", "also_known_as": ["patients", "medical patient", "human patient"] } }, "instance_of": [ {"id": "Q214339", "label": "role"} ], "subclass_of": [ {"id": "Q12722854", "label": "sick person"}, {"id": "Q852835", "label": "customer"} ] } ``` ## Usage ### Loading the Dataset ```python from datasets import load_dataset dataset = load_dataset("SIRIS-Lab/impuls-wikidata-kb") kb = dataset["train"] print(f"Loaded {len(kb)} concepts") ``` ### Query Expansion Example ```python def find_concept(kb, query): """Find concept by keyword or label.""" query_lower = query.lower() for concept in kb: if concept["keyword"].lower() == query_lower: return concept for lang in ["en", "es", "ca"]: if concept["languages"].get(lang, {}).get("label", "").lower() == query_lower: return concept return None def get_expansions(concept): """Get all labels and aliases for a concept.""" expansions = set() for lang_data in concept["languages"].values(): if lang_data.get("label"): expansions.add(lang_data["label"]) for alias in lang_data.get("also_known_as", []): expansions.add(alias) return expansions # Example concept = find_concept(kb, "machine learning") if concept: print(f"Wikidata ID: {concept['wikidata_id']}") print(f"Expansions: {get_expansions(concept)}") # Output: {'machine learning', 'ML', 'aprenentatge automàtic', 'aprendizaje automático', ...} ``` ### Building a Lookup Index ```python def build_kb_index(kb): """Build wikidata_id -> concept index for fast parent lookup.""" return {concept["wikidata_id"]: concept for concept in kb} kb_index = build_kb_index(kb) # Get parent concepts concept = find_concept(kb, "deep learning") for parent in concept.get("subclass_of", []): parent_concept = kb_index.get(parent["id"]) if parent_concept: print(f"Parent: {parent_concept['keyword']}") ``` ## Data Collection The knowledge base was built by: 1. **Seed Selection**: Identifying R&D-relevant concepts from project databases (RIS3CAT, OpenAIRE, CORDIS) 2. **Wikidata Extraction**: Querying Wikidata API for each concept's labels, aliases, and relations 3. **Multilingual Enrichment**: Ensuring coverage across CA/ES/EN/IT 4. **Hierarchy Validation**: Filtering subclass_of relations to include only parents present in the KB 5. **Quality Control**: Manual review of key domain concepts ## Integration with IMPULS This KB is used by the [IMPULS Query Parser](https://huggingface.co/SIRIS-Lab/impuls-salamandra-7b-query-parser) for: - **Query Expansion**: Adding multilingual synonyms to search queries - **Cross-lingual Search**: Finding Spanish projects with Catalan queries - **Concept Navigation**: Broadening searches via parent concepts ## Limitations - **Domain Focus**: Optimized for R&D/scientific concepts; general vocabulary coverage is limited - **Language Coverage**: Best coverage in English; some concepts may lack labels in other languages - **Temporal Snapshot**: Based on Wikidata as of late 2024; may not reflect recent additions - **Hierarchy Depth**: Only direct parents (subclass_of) are included; transitive closure not computed ## Citation ```bibtex @misc{impuls-wikidata-kb-2024, author = {SIRIS Academic}, title = {IMPULS R&D Knowledge Base: Multilingual Wikidata Concepts for Query Expansion}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/SIRIS-Lab/impuls-wikidata-kb}} } ``` ## Acknowledgments - **[Wikidata](https://www.wikidata.org/)** - Source knowledge graph - **[Barcelona Supercomputing Center (BSC)](https://www.bsc.es/)** - AINA project infrastructure - **[Generalitat de Catalunya](https://web.gencat.cat/)** - Funding and RIS3-MCAT platform ## License Apache 2.0 ## Related Resources - **Query Parser Model**: [SIRIS-Lab/impuls-salamandra-7b-query-parser](https://huggingface.co/SIRIS-Lab/impuls-salamandra-7b-query-parser) - **Query Parsing Dataset**: [SIRIS-Lab/impuls-query-parsing](https://huggingface.co/datasets/SIRIS-Lab/impuls-query-parsing) - **Project Repository**: [github.com/sirisacademic/aina-impulse](https://github.com/sirisacademic/aina-impulse)
提供机构:
SIRIS-Lab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作