ada-flo/nlp-hack-debate

Name: ada-flo/nlp-hack-debate
Creator: ada-flo
Published: 2026-04-27 05:51:14
License: 暂无描述

Hugging Face2026-04-27 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/ada-flo/nlp-hack-debate

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - ko license: cc-by-4.0 task_categories: - text2text-generation pretty_name: Debate-Themed Dialogue Generation Dataset size_categories: - 10K<n<100K --- # ada-flo/nlp-hack-debate Bilingual (English + Korean) training data for an LSTM-based seq2seq debate chatbot. Each record is a `(topic, input_context, target_output)` triple plus precomputed `encoder_input` / `decoder_input` / `decoder_target` ready for seq2seq training. ## Schema ```json { "id": "ibm_argq_30k_8b4b12caccad", "lang": "en", "source": "ibm_argq_30k", "is_synthetic": false, "input_stance": "pro", "target_stance": "con", "topic": "We should abandon marriage", "input_context": "abandoning marriage allows for people to grow as themselves...", "target_output": "committment and stability are important in the lives of children...", "encoder_input": "We should abandon marriage <SEP> abandoning marriage allows...", "decoder_input": "<SOS> committment and stability are important...", "decoder_target": "committment and stability are important... <EOS>", "meta": { "source_record_ids": [], "quality_input_WA": 1.0, "...": "..." } } ``` Top-level fields filterable in the HF dataset viewer: | Field | Values | |---|---| | `lang` | `en`, `ko` | | `source` | `ibm_argq_30k`, `mc_conversation`, `isotonic_conversation`, `casual_conversation`, `ko_debate_synth`, `korean_petitions` | | `is_synthetic` | `true`, `false` | | `input_stance` | `pro`, `con`, `petition_position`, `supportive`, `oppositional`, … | | `target_stance` | `pro`, `con`, `opposition`, … | ## Splits | Split | Records | EN | KO | |---|---|---|---| | train | 40,006 | 27,093 | 12,913 | | validation | 5,050 | 3,811 | 1,239 | | test | 4,429 | 2,760 | 1,669 | Splits are **topic-level** for debate-shaped sources (motion-grouped records all land in one split — no leakage). Casual chat and topic-seeded synth use row-wise split because they share placeholder topics. ## Sources (train split) | Source | Records | |---|---| | ibm_argq_30k | 24,126 | | korean_petitions | 8,203 | | ko_debate_synth | 4,710 | | isotonic_conversation | 1,186 | | mc_conversation | 971 | | casual_conversation | 810 | ## Source descriptions - **ibm_argq_30k** — [IBM Argument Quality Ranking 30K](https://huggingface.co/datasets/ibm-research/argument_quality_ranking_30k). Real human pro/con stance pairs over ~70 motions. - **mc_conversation** — [mc-ai/conversation_dataset](https://huggingface.co/datasets/mc-ai/conversation_dataset), filtered to `corpus_id=persuasionforgood`. Real persuasion-themed multi-turn dialogue (Persuasion-for-Good corpus). - **isotonic_conversation** — [Isotonic/human_assistant_conversation](https://huggingface.co/datasets/Isotonic/human_assistant_conversation), filtered to single-turn rows without dialog markers or code-task content. - **casual_conversation** — [SohamGhadge/casual-conversation](https://huggingface.co/datasets/SohamGhadge/casual-conversation). Casual greeting-style exchanges for conversational fluency. - **ko_debate_synth** — Topic-seeded debate-pair synthesis (Korean). 98 curated debate motions × 30 LLM-generated PRO/CON pairs each. Uses Qwen3-235B-A22B-Instruct via vLLM at temperature 0.9. Both directions per pair. - **korean_petitions** — Korean Petitions corpus (청와대 국민청원 2017–2019, via Korpora). Petition title = motion, body (truncated to 280 chars) = `input_context`, vLLM-synthesized counter-argument = `target_output`. ## Synthetic data Records with `meta.is_synthetic=true` were generated by Qwen3-235B-A22B-Instruct served via vLLM. Synthesis prompt versions are recorded in `meta.synthesis_prompt_version`. | Prompt version | Used by | |---|---| | v1 (counterargument) | korean_petitions | | v1 (debate_pair) | ko_debate_synth | Prompts: see `src/synth/prompts.py` in the source repository. ## License CC BY 4.0. Source corpora retain their original licenses; consult each source link above for redistribution terms before commercial use. ## Repository Generated by https://github.com/ada-flo/nlp-hack — see that repo for the full preprocessing pipeline, source adapters, and synth client code.

提供机构：

ada-flo

5,000+

优质数据集

54 个

任务类型

进入经典数据集