ShanzaGull/world_Top_leaders_Dataset
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/ShanzaGull/world_Top_leaders_Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- question-answering
- text-generation
language:
- en
tags:
- history
- world-leaders
- chatml
- synthetic
license: mit
---
# Dataset Card: World Leaders ChatML Dataset
## Dataset Description
The **World Leaders ChatML Dataset** is a synthetically generated dataset comprising approximately 2,000 question-and-answer pairs focused on the lives, accomplishments, and historical significance of 15 highly influential world leaders.
This dataset is specifically designed for instruction-tuning and fine-tuning Large Language Models (LLMs), such as TinyLlama, Llama 2, or Mistral, using the standard **ChatML** conversational format.
### Supported Tasks
- **Instruction Fine-Tuning**: Training models to respond to historical and biographical queries.
- **Conversational QA**: Building AI assistants that act as historical experts or educational tutors.
## Dataset Structure
### Format
The dataset is provided in a `.jsonl` (JSON Lines) format. Each line represents a single conversation/turn formatted according to the ChatML standard.
### Data Fields
- `messages`: A list of dictionaries, where each dictionary represents a turn in the conversation.
- `role`: The speaker's role. Can be either `"user"` (asking the question) or `"assistant"` (providing the answer).
- `content`: The actual text of the question or the answer.
### Example Entry
```json
{
"messages": [
{
"role": "user",
"content": "Can you explain what Nelson Mandela did regarding a second presidential term?"
},
{
"role": "assistant",
"content": "Based on the text, Nelson Mandela was significantly involved with a second presidential term. This section details their specific actions and historical impact."
}
]
}
```
*(Note: The current answers are placeholders/mock data designed to test the pipeline structure. They should be replaced by high-quality LLM-generated answers before production training.)*
## Dataset Creation
### Source Data
The raw text was automatically scraped from the official **Wikipedia** pages of the selected leaders using a custom Python pipeline (`world_leaders_scrap.py`).
### Leaders Included
To maintain a balanced dataset, approximately ~133 rows were generated for each of the following 15 leaders:
1. Nelson Mandela
2. Winston Churchill
3. Abraham Lincoln
4. Mahatma Gandhi
5. Franklin D. Roosevelt
6. Margaret Thatcher
7. Martin Luther King Jr.
8. George Washington
9. Alexander the Great
10. Imran Khan
11. Napoleon
12. Genghis Khan
13. Elizabeth I
14. Muhammad Ali Jinnah
15. Mustafa Kemal Atatürk
### Pipeline Process
1. **Scraping**: Wikipedia articles for each leader were scraped.
2. **Cleaning**: Extraneous HTML, whitespace, and formatting were stripped.
3. **Chunking**: Text was segmented into overlapping chunks of 150-200 words to maintain context.
4. **Generation**: An LLM prompt generator dynamically injected the leader's name and chunk-specific keywords to create unique, varied queries (e.g., *"What is the historical significance of..."*, *"Please provide details on..."*).
## Usage
To use this dataset with the Hugging Face `datasets` library:
```python
from datasets import load_dataset
# Load from local JSONL file
dataset = load_dataset("json", data_files="world_leaders_dataset.jsonl", split="train")
print(dataset[0])
```
## Limitations and Future Work
- **Mock Answers**: The current iteration contains placeholder assistant responses to establish the pipeline architecture. To make this dataset viable for actual model training, a real LLM API (like OpenAI or Anthropic) must be integrated into the scraping script to generate factually accurate answers based on the Wikipedia chunks.
- **Context Cutoffs**: Because the context is scraped dynamically, some user questions may contain cut-off sentences or partial phrases from the Wikipedia text (e.g., *"regarding cape province. one of?"*). Improving the chunking regex could result in cleaner questions.
提供机构:
ShanzaGull



