five

ShanzaGull/world_Top_leaders_Dataset

收藏
Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/ShanzaGull/world_Top_leaders_Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - question-answering - text-generation language: - en tags: - history - world-leaders - chatml - synthetic license: mit --- # Dataset Card: World Leaders ChatML Dataset ## Dataset Description The **World Leaders ChatML Dataset** is a synthetically generated dataset comprising approximately 2,000 question-and-answer pairs focused on the lives, accomplishments, and historical significance of 15 highly influential world leaders. This dataset is specifically designed for instruction-tuning and fine-tuning Large Language Models (LLMs), such as TinyLlama, Llama 2, or Mistral, using the standard **ChatML** conversational format. ### Supported Tasks - **Instruction Fine-Tuning**: Training models to respond to historical and biographical queries. - **Conversational QA**: Building AI assistants that act as historical experts or educational tutors. ## Dataset Structure ### Format The dataset is provided in a `.jsonl` (JSON Lines) format. Each line represents a single conversation/turn formatted according to the ChatML standard. ### Data Fields - `messages`: A list of dictionaries, where each dictionary represents a turn in the conversation. - `role`: The speaker's role. Can be either `"user"` (asking the question) or `"assistant"` (providing the answer). - `content`: The actual text of the question or the answer. ### Example Entry ```json { "messages": [ { "role": "user", "content": "Can you explain what Nelson Mandela did regarding a second presidential term?" }, { "role": "assistant", "content": "Based on the text, Nelson Mandela was significantly involved with a second presidential term. This section details their specific actions and historical impact." } ] } ``` *(Note: The current answers are placeholders/mock data designed to test the pipeline structure. They should be replaced by high-quality LLM-generated answers before production training.)* ## Dataset Creation ### Source Data The raw text was automatically scraped from the official **Wikipedia** pages of the selected leaders using a custom Python pipeline (`world_leaders_scrap.py`). ### Leaders Included To maintain a balanced dataset, approximately ~133 rows were generated for each of the following 15 leaders: 1. Nelson Mandela 2. Winston Churchill 3. Abraham Lincoln 4. Mahatma Gandhi 5. Franklin D. Roosevelt 6. Margaret Thatcher 7. Martin Luther King Jr. 8. George Washington 9. Alexander the Great 10. Imran Khan 11. Napoleon 12. Genghis Khan 13. Elizabeth I 14. Muhammad Ali Jinnah 15. Mustafa Kemal Atatürk ### Pipeline Process 1. **Scraping**: Wikipedia articles for each leader were scraped. 2. **Cleaning**: Extraneous HTML, whitespace, and formatting were stripped. 3. **Chunking**: Text was segmented into overlapping chunks of 150-200 words to maintain context. 4. **Generation**: An LLM prompt generator dynamically injected the leader's name and chunk-specific keywords to create unique, varied queries (e.g., *"What is the historical significance of..."*, *"Please provide details on..."*). ## Usage To use this dataset with the Hugging Face `datasets` library: ```python from datasets import load_dataset # Load from local JSONL file dataset = load_dataset("json", data_files="world_leaders_dataset.jsonl", split="train") print(dataset[0]) ``` ## Limitations and Future Work - **Mock Answers**: The current iteration contains placeholder assistant responses to establish the pipeline architecture. To make this dataset viable for actual model training, a real LLM API (like OpenAI or Anthropic) must be integrated into the scraping script to generate factually accurate answers based on the Wikipedia chunks. - **Context Cutoffs**: Because the context is scraped dynamically, some user questions may contain cut-off sentences or partial phrases from the Wikipedia text (e.g., *"regarding cape province. one of?"*). Improving the chunking regex could result in cleaner questions.
提供机构:
ShanzaGull
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作