filipwx/ted-podcast-finetune
收藏Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/filipwx/ted-podcast-finetune
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
- pt
tags:
- ted-talks
- podcasts
- fine-tuning
- llm
- conversation
- technology
- leadership
- business
- science
- lex-fridman
- joe-rogan
size_categories:
- 1K<n<10K
---
# LLM Fine-tuning Dataset: TED Talks + Podcasts
A structured dataset of transcripts from popular TED Talks and podcasts (Lex Fridman Podcast, Joe Rogan Experience), formatted for LLM fine-tuning.
## Dataset Summary
| Property | Value |
|---|---|
| **Total chunks** | 2,036 |
| **Unique episodes/talks** | 48 |
| **Train split** | 1,831 records |
| **Validation split** | 204 records |
| **Approx. total words** | 0 |
| **Languages** | English (primary), Portuguese (some TED) |
| **Format** | Chat / Instruction-response |
## Sources
| Source | Chunks | Topics |
|---|---|---|
| TED Talks | 170 | Leadership, Technology, Business, Science, Psychology |
| Lex Fridman Podcast | 1809 | AI, Technology, Science, Philosophy, Business |
| Joe Rogan Experience | 57 | Technology, Science, Business, Society |
## Format
### OpenAI Fine-tuning (Chat format)
Files: `dataset_train.jsonl`, `dataset_validation.jsonl`
```json
{
"messages": [
{"role": "system", "content": "You are a knowledgeable expert..."},
{"role": "user", "content": "What are the key ideas discussed here?"},
{"role": "assistant", "content": "The core argument is..."}
]
}
```
### Hugging Face Format
File: `dataset_huggingface.jsonl`
```json
{
"id": "ted_qp0HIF3SfI4_chunk0_qa",
"source": "TED",
"language": "en",
"category": "leadership",
"type": "qa_pair",
"text": "...",
"instruction": "What is the main argument of this talk?",
"system": "You are a knowledgeable assistant...",
"messages": [...],
"split": "train",
"word_count": 512
}
```
### Text Completion Format
File: `dataset_text_completion.jsonl`
```json
{
"id": "ted_qp0HIF3SfI4_chunk0",
"source": "TED",
"title": "How great leaders inspire action",
"language": "en",
"text": "The core argument is..."
}
```
## Data Cleaning
All transcripts were processed through:
1. **Timestamp removal**: `[00:01:23]`, `(00:01)`, `0:01:23`
2. **Speaker label removal**: `Lex Fridman:`, `SPEAKER_00:`, `Host:`
3. **Noise annotation removal**: `[applause]`, `[laughter]`, `(Music)`
4. **Boilerplate removal**: Podcast intros, sponsor messages, contact info
5. **URL/email removal**
6. **Unicode normalization**: Smart quotes → straight quotes
7. **Whitespace normalization**
8. **Minimum length filter**: Chunks with <30 words removed
## Usage
### Load with Hugging Face Datasets
```python
from datasets import load_dataset
dataset = load_dataset("json", data_files={
"train": "dataset_train.jsonl",
"validation": "dataset_validation.jsonl"
})
```
### OpenAI Fine-tuning
```python
from openai import OpenAI
client = OpenAI()
with open("dataset_train.jsonl", "rb") as f:
file = client.files.create(file=f, purpose="fine-tune")
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini"
)
```
### Validate before upload
```python
import json
with open("dataset_train.jsonl") as f:
for i, line in enumerate(f):
rec = json.loads(line)
assert "messages" in rec
for msg in rec["messages"]:
assert "role" in msg and "content" in msg
print(f"All {i+1} records valid!")
```
## Topics Covered
**TED Talks:** Leadership & Management, Motivation & Productivity, Psychology & Behavior, Technology & Innovation, Creativity & Education, Philosophy & Ethics, Science & Neuroscience, Communication
**Lex Fridman Podcast:** Artificial Intelligence, Machine Learning, Software Engineering, Neuroscience, Physics & Mathematics, Geopolitics, Entrepreneurship
**Joe Rogan Experience:** Technology, Science, Health & Fitness, Philosophy
## Files
| File | Description |
|---|---|
| `dataset_train.jsonl` | Training split — OpenAI/HF chat format |
| `dataset_validation.jsonl` | Validation split — OpenAI/HF chat format |
| `dataset_huggingface.jsonl` | Full dataset with metadata |
| `dataset_text_completion.jsonl` | Plain text completion format |
| `dataset_full.csv` | CSV with all chunks |
| `dataset_episodes.csv` | Episode-level summary |
## License
This dataset is released under CC-BY 4.0. Transcripts are derived from publicly available content. TED transcripts © TED Conferences LLC (used for research/educational purposes). Podcast transcripts are from publicly available sources.
## Citation
```
@dataset{ted_podcast_finetune_2026,
title={TED Talks + Podcasts LLM Fine-tuning Dataset},
year={2026},
publisher={Filipe Machado / Bit Pag LTDA},
sources={TED.com, lexfridman.com, joerogan.com},
format={JSONL / CSV},
records={2,036}
}
```
提供机构:
filipwx



