five

filipwx/ted-podcast-finetune

收藏
Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/filipwx/ted-podcast-finetune
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en - pt tags: - ted-talks - podcasts - fine-tuning - llm - conversation - technology - leadership - business - science - lex-fridman - joe-rogan size_categories: - 1K<n<10K --- # LLM Fine-tuning Dataset: TED Talks + Podcasts A structured dataset of transcripts from popular TED Talks and podcasts (Lex Fridman Podcast, Joe Rogan Experience), formatted for LLM fine-tuning. ## Dataset Summary | Property | Value | |---|---| | **Total chunks** | 2,036 | | **Unique episodes/talks** | 48 | | **Train split** | 1,831 records | | **Validation split** | 204 records | | **Approx. total words** | 0 | | **Languages** | English (primary), Portuguese (some TED) | | **Format** | Chat / Instruction-response | ## Sources | Source | Chunks | Topics | |---|---|---| | TED Talks | 170 | Leadership, Technology, Business, Science, Psychology | | Lex Fridman Podcast | 1809 | AI, Technology, Science, Philosophy, Business | | Joe Rogan Experience | 57 | Technology, Science, Business, Society | ## Format ### OpenAI Fine-tuning (Chat format) Files: `dataset_train.jsonl`, `dataset_validation.jsonl` ```json { "messages": [ {"role": "system", "content": "You are a knowledgeable expert..."}, {"role": "user", "content": "What are the key ideas discussed here?"}, {"role": "assistant", "content": "The core argument is..."} ] } ``` ### Hugging Face Format File: `dataset_huggingface.jsonl` ```json { "id": "ted_qp0HIF3SfI4_chunk0_qa", "source": "TED", "language": "en", "category": "leadership", "type": "qa_pair", "text": "...", "instruction": "What is the main argument of this talk?", "system": "You are a knowledgeable assistant...", "messages": [...], "split": "train", "word_count": 512 } ``` ### Text Completion Format File: `dataset_text_completion.jsonl` ```json { "id": "ted_qp0HIF3SfI4_chunk0", "source": "TED", "title": "How great leaders inspire action", "language": "en", "text": "The core argument is..." } ``` ## Data Cleaning All transcripts were processed through: 1. **Timestamp removal**: `[00:01:23]`, `(00:01)`, `0:01:23` 2. **Speaker label removal**: `Lex Fridman:`, `SPEAKER_00:`, `Host:` 3. **Noise annotation removal**: `[applause]`, `[laughter]`, `(Music)` 4. **Boilerplate removal**: Podcast intros, sponsor messages, contact info 5. **URL/email removal** 6. **Unicode normalization**: Smart quotes → straight quotes 7. **Whitespace normalization** 8. **Minimum length filter**: Chunks with <30 words removed ## Usage ### Load with Hugging Face Datasets ```python from datasets import load_dataset dataset = load_dataset("json", data_files={ "train": "dataset_train.jsonl", "validation": "dataset_validation.jsonl" }) ``` ### OpenAI Fine-tuning ```python from openai import OpenAI client = OpenAI() with open("dataset_train.jsonl", "rb") as f: file = client.files.create(file=f, purpose="fine-tune") job = client.fine_tuning.jobs.create( training_file=file.id, model="gpt-4o-mini" ) ``` ### Validate before upload ```python import json with open("dataset_train.jsonl") as f: for i, line in enumerate(f): rec = json.loads(line) assert "messages" in rec for msg in rec["messages"]: assert "role" in msg and "content" in msg print(f"All {i+1} records valid!") ``` ## Topics Covered **TED Talks:** Leadership & Management, Motivation & Productivity, Psychology & Behavior, Technology & Innovation, Creativity & Education, Philosophy & Ethics, Science & Neuroscience, Communication **Lex Fridman Podcast:** Artificial Intelligence, Machine Learning, Software Engineering, Neuroscience, Physics & Mathematics, Geopolitics, Entrepreneurship **Joe Rogan Experience:** Technology, Science, Health & Fitness, Philosophy ## Files | File | Description | |---|---| | `dataset_train.jsonl` | Training split — OpenAI/HF chat format | | `dataset_validation.jsonl` | Validation split — OpenAI/HF chat format | | `dataset_huggingface.jsonl` | Full dataset with metadata | | `dataset_text_completion.jsonl` | Plain text completion format | | `dataset_full.csv` | CSV with all chunks | | `dataset_episodes.csv` | Episode-level summary | ## License This dataset is released under CC-BY 4.0. Transcripts are derived from publicly available content. TED transcripts © TED Conferences LLC (used for research/educational purposes). Podcast transcripts are from publicly available sources. ## Citation ``` @dataset{ted_podcast_finetune_2026, title={TED Talks + Podcasts LLM Fine-tuning Dataset}, year={2026}, publisher={Filipe Machado / Bit Pag LTDA}, sources={TED.com, lexfridman.com, joerogan.com}, format={JSONL / CSV}, records={2,036} } ```
提供机构:
filipwx
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作