TigreGotico/sentence-types-multilingual

Name: TigreGotico/sentence-types-multilingual
Creator: TigreGotico
Published: 2026-04-02 00:06:35
License: 暂无描述

Hugging Face2026-04-02 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/TigreGotico/sentence-types-multilingual

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - es - fr - de - it - pt - nl license: apache-2.0 task_categories: - text-classification pretty_name: Little Questions - Multilingual Sentence Types size_categories: - 10K<n<100K tags: - multilingual - sentence-classification - question-classification --- # Little Questions: Multilingual Sentence Types Dataset A multilingual dataset of 69,300 labeled sentences (9,900 per language) across 6 sentence type categories and 7 languages. Designed for training and evaluating sentence-type classifiers in multilingual contexts. ## Dataset Details - **Total entries**: 69,300 (9,900 × 7 languages) - **Languages**: English (EN), Spanish (ES), French (FR), German (DE), Italian (IT), Portuguese (PT), Dutch (NL) - **Class distribution**: 13,200 entries per label (perfectly balanced) - **Format**: CSV (UTF-8) ## Label Taxonomy Each sentence is classified into one of six mutually exclusive categories: | Label | Definition | Examples | |-------|-----------|----------| | `command` | Direct imperative with no polite framing. Verb-initial constructions ordering action. | "Close the door", "Stop talking", "Send me the file" | | `exclamation` | Expressive, emphatic sentences conveying emotion or emphasis, often with "What a…!" or "How…!" constructions. | "What a beautiful sunset!", "How wonderful!", "That's incredible!" | | `polar_question` | Yes/no questions seeking binary affirmation or negation, typically via auxiliary inversion or modal forms. | "Do you like coffee?", "Can you help me?", "Is it raining?" | | `request` | Polite ask using conditional or modal forms ("Could you", "Would you", "Can you", "May I", "Might I"). Frames action as option rather than command. | "Could you pass the salt?", "Would you mind closing the window?", "May I borrow your pen?" | | `statement` | Declarative sentences reporting facts, states, or observations with no interrogative or imperative structure. | "The Earth orbits the Sun", "I live in Paris", "She is a doctor" | | `wh_question` | Open-ended information-seeking questions using wh-words (Who, What, When, Where, Why, How). Expects substantive answer, not binary response. | "Where are you from?", "What time is it?", "How does photosynthesis work?" | ## Generation Process **English Source (9,900 entries):** 1. Started with a base corpus of 3,001 sentences across 6 classes 2. Applied rule-based validation and correction to fix label drift (e.g., "Could you…?" → `request`, not `polar_question`) 3. Hand-authored additional entries to achieve target balance of 1,650 per class 4. Final English dataset spans diverse registers (formal, casual, technical, conversational) and contexts (workplace, social, travel, services, household, academic) **Multilingual Translation:** 1. Translated English dataset to 7 languages using **Tower-Plus-2B-GGUF** (1.71 GB Q4_K_M quantization) 2. Ran locally via `llama-cpp-python` with Gemma2 chat tokens for accurate instruction following 3. Used checkpoint/resume pattern for fault tolerance during long-running translation jobs 4. All labels preserved verbatim during translation (no label drift) **Data Quality:** - No exact duplicates within or across languages - Balanced class distribution: 13,200 entries per label (1,650 per label per language) - Validated translations spot-checked for coherence, encoding, and semantic preservation - All text UTF-8 encoded with proper diacritical marks preserved ## Format CSV with three columns: ``` language,label,text EN,command,Close the door ES,command,Cierra la puerta FR,command,Ferme la porte ``` - `language`: BCP 47 language code (en, es, fr, de, it, pt, nl) - `label`: One of {command, exclamation, polar_question, request, statement, wh_question} - `text`: Sentence in target language (UTF-8) ## Usage Load with pandas: ```python import pandas as pd df = pd.read_csv('sentence_types_multilingual.csv') # Filter by language: df[df['language'] == 'EN'] # Filter by label: df[df['label'] == 'request'] # Check balance: df['label'].value_counts() ``` ## Source & Attribution Part of the **little-questions** project — a lightweight multilingual question classification library. Translations generated using **Tower-Plus-2B-GGUF** quantized LLM (Unbabel/Tower-Plus-2B) via llama-cpp-python. ## License Apache 2.0

提供机构：

TigreGotico

5,000+

优质数据集

54 个

任务类型

进入经典数据集