five

OpenVoiceOS/yes_no_answers

收藏
Hugging Face2026-04-24 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/OpenVoiceOS/yes_no_answers
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - de - fr - es - it - pt - ru - uk - pl - nl - sv - da - fi - nb - nn - cs - sk - ro - hr - sl - hu - bg - el - ca - lt - lv - et - eu - gl - is - an - ja - ko - zh - ar - he - fa - tr - id - ms - fil - vi - th license: apache-2.0 task_categories: - text-classification task_ids: - intent-classification tags: - yes-no - dialogue - multilingual - agreement - intent size_categories: - 10K<n<100K --- # Yes/No Multilingual Answers Dataset A dataset of **10,709** conversational utterances for classifying yes/no/ambiguous responses across **43 languages**. ## Dataset Description Each sample is a natural language utterance a person might say in response to a yes/no question. The dataset covers three classes: | Label | Description | |-------|-------------| | `yes` | Affirmation, agreement, or confirmation | | `no` | Negation, refusal, or disagreement | | `None` | Genuinely ambiguous — cannot be resolved without context | ### Schema ``` utterance,agreement,subtype,language "ja","yes","Y1","de" "absolument pas","no","N2","fr" "peut-être","None","C1","fr" ``` ## Statistics | Metric | Value | |--------|-------| | Total samples | 10,709 | | Languages | 43 | | Samples per language | 224–290 (avg 249) | | Label: yes | 3,873 (36.2%) | | Label: no | 3,826 (35.7%) | | Label: None | 3,010 (28.1%) | | Semantic subtypes | 28 | | Min samples per subtype per language | 8 | ## Languages **European:** English · German · French · Spanish · Italian · Portuguese · Russian · Ukrainian · Polish · Dutch · Swedish · Danish · Finnish · Norwegian Bokmål · Norwegian Nynorsk · Czech · Slovak · Romanian · Croatian · Slovenian · Hungarian · Bulgarian · Greek · Catalan · Lithuanian · Latvian · Estonian · Basque · Galician · Icelandic · Aragonese **Asian & Middle Eastern:** Japanese · Korean · Chinese · Arabic · Hebrew · Persian · Turkish · Indonesian · Malay · Filipino · Vietnamese · Thai ## Semantic Subtypes ### YES (Y1–Y10) | ID | Description | English Examples | |----|-------------|-----------------| | Y1 | Direct affirmation | yes, yeah, yep, aye | | Y2 | Emphatic affirmation | absolutely, definitely, without a doubt | | Y3 | Polite/soft affirmation | of course, gladly, with pleasure | | Y4 | Colloquial/slang affirmation | you bet, totally, hell yeah | | Y5 | Agreement with proposition | I agree, exactly, spot on | | Y6 | Preference/willingness | I'd love to, I'm in, sounds good | | Y7 | Paradox resolving to yes | I can't say no, I don't disagree | | Y8 | Rhetorical confirmation | is the sky blue?, does a bear live in the woods? | | Y9 | Non-verbal/gestural description | *nods*, *thumbs up* | | Y10 | Contextual indirect yes | let's do it, that works for me | ### NO (N1–N10) | ID | Description | English Examples | |----|-------------|-----------------| | N1 | Direct negation | no, nope, nay, nah | | N2 | Emphatic negation | absolutely not, never, no way | | N3 | Polite/soft negation | I'd rather not, I'm afraid not | | N4 | Colloquial/slang negation | hard pass, not happening, fat chance | | N5 | Disagreement with proposition | I disagree, you're wrong, that's incorrect | | N6 | Refusal/aversion | I refuse, count me out, I won't | | N7 | Paradox resolving to no | yes but actually no, yes yes yes but no | | N8 | Rhetorical denial | when pigs fly, not in a million years | | N9 | Non-verbal/gestural description | *shakes head*, *thumbs down* | | N10 | Contextual indirect no | I'll pass, no thank you, I'm good | ### NONE / Ambiguous (C1–C8) | ID | Description | English Examples | |----|-------------|-----------------| | C1 | Pure uncertainty | maybe, perhaps, I'm not sure | | C2 | Conditional yes | only if, depends on the price | | C3 | Conditional no | unless you can prove it, not if it costs money | | C4 | Deferral / time-based | later, not now, ask me again | | C5 | Processing / thinking | let me think, I'm considering it | | C6 | Ambiguous both-sides | it depends, I have mixed feelings | | C7 | Redirection / clarification | why do you ask?, what do you mean? | | C8 | Partial agreement | sort of, kind of, more or less | ## Files | File | Description | |------|-------------| | `yesno_multilingual.csv` | Main dataset (10,709 rows) | | `taxonomy.md` | Full taxonomy, subtype definitions, and golden rules | ## Usage ```python from datasets import load_dataset ds = load_dataset("Jarbas/yes-no-multilingual") ``` ### Filter by language ```python en = ds["train"].filter(lambda x: x["language"] == "en") ``` ### Filter by label ```python yes_only = ds["train"].filter(lambda x: x["agreement"] == "yes") ``` ## How the Data Was Generated All utterances were generated directly by a large language model (Claude) acting as a multilingual conversational AI. No machine translation was used — each utterance was composed idiomatically in its target language from scratch. The generation process followed a strict per-language protocol: 1. **Taxonomy-first**: Each language block was generated by iterating over all 28 semantic subtypes (Y1–Y10, N1–N10, C1–C8) and producing multiple idiomatic examples per subtype. 2. **Register coverage**: Examples span formal, neutral, and casual registers. Languages with formal/informal T–V distinction (German du/Sie, French tu/vous, Spanish tú/usted, Japanese plain/polite forms, Korean formal/informal, etc.) include both. 3. **Golden rules enforcement**: Each utterance was checked against validation rules covering label integrity, no label leaking, length ≤ 75 characters, naturalism, and uniqueness. 4. **Cultural authenticity**: Rhetorical forms (Y8, N8) use idioms native to each language's culture rather than translated English expressions. 5. **Deduplication**: A global deduplication pass ensures no utterance appears twice across the entire dataset. The language set aligns with the [OVOS localize](https://github.com/OpenVoiceOS/ovos-localize) classification dataset, covering European, Middle Eastern, and Asian languages including minority and regional languages (Basque, Catalan, Galician, Aragonese, Norwegian Nynorsk, Icelandic). ## Quality Guarantees - **No machine translation** — all utterances are idiomatically authentic per language - **≥ 8 samples per subtype per language** — every (language × subtype) cell is covered - **Zero duplicates** — global case-insensitive deduplication across all 43 languages - **Zero overlength entries** — all utterances ≤ 75 characters - **Register diversity** — formal, neutral, and casual speech per language - **Paradox handling** — utterances like "yes but actually no" are labeled by final resolution ## License Apache 2.0
提供机构:
OpenVoiceOS
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作