Hebbelille/Norwegian-Synthetic-HR-data-v-1

Name: Hebbelille/Norwegian-Synthetic-HR-data-v-1
Creator: Hebbelille
Published: 2025-12-06 21:14:13
License: 暂无描述

Hugging Face2025-12-06 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Hebbelille/Norwegian-Synthetic-HR-data-v-1

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation - question-answering language: - 'no' - nb tags: - synthetic size_categories: - 1K<n<10K --- # Synthetic norwegian public sector HR dataset ## Dataset description This dataset contains **4,000 rows** of synthetic instructional data focused on Human Resources (HR) topics within the Norwegian public sector. The license for the dataset follows the license of the LLMs used to generate the data. Users are advised to review the specific terms associated with the source models before use. The datasets includes Chain of Thought **(CoT)** reasoning traces and is generated using a multi-model approach to ensure stylistic diversity. * **Language:** Norwegian (Bokmål) * **Total size:** 4,000 rows * **Source generation models:** * `OSS-GPT:20b` ([openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)) - 2,000 rows * `Magistral-Small-2509-Q5` ([mistralai/Magistral-Small-2509-GGUF](https://huggingface.co/mistralai/Magistral-Small-2509-GGUF)) - 2,000 rows * **Method:** * Multi-step synthetic generation with persona injection, language validation, and reasoning extraction. * The data generation process was inspired by [this process (on X)](https://x.com/QuixiAI/status/1927046460966641833?s=20) by [Eric Hartford (X)](https://x.com/QuixiAI?s=20). ## Dataset structure The dataset is provided in **JSON Lines (.jsonl)** format. Each entry represents a unique training sample containing the user's query, the model's internal reasoning, and the final response. ### Data fields * `id`: (numeric) Unique identifier. * `topic`: (string) The specific HR sub-theme (e.g., "Endringsledelse"). * `context_generated`: (String) A generated setting description. * `persona`: (string) The archetype adopted by the user (e.g., "Frustrert mellomleder"). * `ground_truth_scenario`: (string) A generated grounded scenario the model used to generate the query. * `instruction`: (string) The user's question or request. * `thought_process`: (string) The Chain of Thought. The model's step-by-step reasoning before answering. * `output`: (string) The final, helpful response provided to the user. * `source`: (string) Identifies which model generated the sample row (`OSS-GPT:20b` or `Magistral-Small-2509-Q5`). ## Data distribution & quality This dataset is explicitly engineered to avoid "model collapse" and stylistic monotony by leveraging two distinct model architectures. ### 1. Stylistic diversity The dataset offers a valuable mix of response styles, making it robust for instruction tuning: * **Conversational & Supportive (Magistral):** The `Magistral` subset averages 250–300 words and typically adopts a personal, email-style format (e.g., "Hei [Navn]"). It focuses on empathetic, coaching-style advice suited for direct communication. * **Structured & Analytical (OSS-GPT):** The `OSS-GPT:20b` subset averages ~550 words and favors highly structured outputs using Markdown tables, step-by-step action plans, and bold headers. It provides comprehensive, report-style guidance. * **Benefit:** Models trained on this dataset learn to handle both "quick fix" queries and complex scenarios requiring nuance. ### 2. Zero topic bias Analysis of the dataset shows a near-perfect balance in topic distribution. Both source models have generated content across all 40 HR topics (from *Psychological Safety* to *Whistleblowing*) with equal frequency. This ensures the dataset does not favor one model for "hard" topics and another for "soft" topics. ### 3. Some quality metrics * **Average Words per Response:** 421.5 * **Average LIX Value (readability score):** 45.1 (Indicates professional/factual text level suitable for HR contexts) About the LIX Value: http://www.iva.dk/bh/core%20concepts%20in%20lis/articles%20a-z/readability.htm ### Model comparison | Source Model | Avg. Word Count | Avg. LIX Score | | :--- | :---: | :---: | | **Magistral-Small-2509-Q5** | 280.3 | 43.7 | | **OSS-GPT:20b** | 562.7 | 46.4 | ![quality_word_count_distribution](https://cdn-uploads.huggingface.co/production/uploads/647e106811084fb5831b3695/fBMvT4nz-DqCSJ11ivXVD.png) ![topic_distribution](https://cdn-uploads.huggingface.co/production/uploads/647e106811084fb5831b3695/Qg_VhkfPm_gP76aNP5VnC.png) ## Content overview ### 1. Persona diversity (voice & tone) The dataset utilizes a split of 40 distinct personas to ensure the model learns to handle different emotional states and professional levels: * **Troubleshooting personas (20 personas):** Focus on friction, anxiety, and conflict. * *e.g.:* "Cynical Department Leader", "Stressed Middle Manager", "Concerned Union Rep". * **Growth & excellence personas (20 personas):** Focus on optimization, strategy, and modernization. * *e.g.*: "Tech-Optimist (AI focus)", "Visionary Director", "Agile Coach". ### 2. Topic breadth The dataset covers a modern spectrum of 40 different HR responsibilities: * **Core HR:** *e.g.:* Conflict management, Sick leave, Onboarding, Internal communication. * **Leadership:** *e.g.:* Psychological safety, Competency development, Trust leadership. * **Strategy:** *.e.g.:* AI in HR, Digital Transformation, Employer Branding, HR Analytics. ## Generation process The data was generated using two Python pipeline with the following logic: 1. **Taxonomy expansion:** High-level topics are dynamically expanded into specific, non-legal HR challenges. 2. **Contextualization:** Semantic generation of organizational contexts. 3. **Persona injection:** Selection of a persona from the Troubleshooting or Growth lists. 4. **Multi-model generation:** * The prompt is sent to the model (`OSS-GPT` or `Magistral`) to generate the instruction and response. 5. **Reasoning extraction:** The model is prompted to "think step-by-step" (`thought_process`) before generating the final answer. 6. **Merging:** The distinct datasets were merged, and source model attribution added. Two python pipline scripts was used to tailor each data generation pipeline to the specific needs and behavior of the models. ## Intended use * **Research and educational purposes only:** This dataset is only intended for research and educational use to be able to make small language models able to respond in norwegian. * **Fine-tuning reasoning models:** The inclusion of `thought_process` makes this ideal for training models to "think" before they speak in norwegian. * **Style robustness:** Training models to be adaptable in length and detail level. ## Limitations & Disclaimer **This dataset should not be used in production ready systems** **Synthetic data warning:** This dataset is 100% synthetic and generated by LLMs. * It has **not** been verified by human HR professionals or lawyers. * Production Use: This dataset is intended for research and educational purposes only. It is **not** intended for use in production-level systems. * It may contain factual inaccuracies regarding Norwegian work-life regulations. * The "advice" provided in the dataset is for **experimental training purposes only** and should not be used as actual HR or legal guidance. * It was explicitly made clear in the scripts that the model should **not** generate data pertaining to legal aspects of any kind as these kind of questions or requests require professional knowledge and expertise that LLMs are unable to provide. If there still are traces of data pertaining to legal aspects these should **not** be considered factual correct and must **not** be used as advice in a real world setting. * The creator of this synthetic dataset is **not** responsible for any end products, models, or advice produced using this dataset. * The dataset is provided "as is," without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and non-infringement. In no event shall the creator be liable for any claim, damages, or other liability, whether in an action of contract, tort, or otherwise, arising from, out of, or in connection with the dataset or the use or other dealings in the dataset.

提供机构：

Hebbelille

5,000+

优质数据集

54 个

任务类型

进入经典数据集