TigreGotico/sentence-types-multilingual
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TigreGotico/sentence-types-multilingual
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- es
- fr
- de
- it
- pt
- nl
license: apache-2.0
task_categories:
- text-classification
pretty_name: Little Questions - Multilingual Sentence Types
size_categories:
- 10K<n<100K
tags:
- multilingual
- sentence-classification
- question-classification
---
# Little Questions: Multilingual Sentence Types Dataset
A multilingual dataset of 69,300 labeled sentences (9,900 per language) across 6 sentence type categories and 7 languages. Designed for training and evaluating sentence-type classifiers in multilingual contexts.
## Dataset Details
- **Total entries**: 69,300 (9,900 × 7 languages)
- **Languages**: English (EN), Spanish (ES), French (FR), German (DE), Italian (IT), Portuguese (PT), Dutch (NL)
- **Class distribution**: 13,200 entries per label (perfectly balanced)
- **Format**: CSV (UTF-8)
## Label Taxonomy
Each sentence is classified into one of six mutually exclusive categories:
| Label | Definition | Examples |
|-------|-----------|----------|
| `command` | Direct imperative with no polite framing. Verb-initial constructions ordering action. | "Close the door", "Stop talking", "Send me the file" |
| `exclamation` | Expressive, emphatic sentences conveying emotion or emphasis, often with "What a…!" or "How…!" constructions. | "What a beautiful sunset!", "How wonderful!", "That's incredible!" |
| `polar_question` | Yes/no questions seeking binary affirmation or negation, typically via auxiliary inversion or modal forms. | "Do you like coffee?", "Can you help me?", "Is it raining?" |
| `request` | Polite ask using conditional or modal forms ("Could you", "Would you", "Can you", "May I", "Might I"). Frames action as option rather than command. | "Could you pass the salt?", "Would you mind closing the window?", "May I borrow your pen?" |
| `statement` | Declarative sentences reporting facts, states, or observations with no interrogative or imperative structure. | "The Earth orbits the Sun", "I live in Paris", "She is a doctor" |
| `wh_question` | Open-ended information-seeking questions using wh-words (Who, What, When, Where, Why, How). Expects substantive answer, not binary response. | "Where are you from?", "What time is it?", "How does photosynthesis work?" |
## Generation Process
**English Source (9,900 entries):**
1. Started with a base corpus of 3,001 sentences across 6 classes
2. Applied rule-based validation and correction to fix label drift (e.g., "Could you…?" → `request`, not `polar_question`)
3. Hand-authored additional entries to achieve target balance of 1,650 per class
4. Final English dataset spans diverse registers (formal, casual, technical, conversational) and contexts (workplace, social, travel, services, household, academic)
**Multilingual Translation:**
1. Translated English dataset to 7 languages using **Tower-Plus-2B-GGUF** (1.71 GB Q4_K_M quantization)
2. Ran locally via `llama-cpp-python` with Gemma2 chat tokens for accurate instruction following
3. Used checkpoint/resume pattern for fault tolerance during long-running translation jobs
4. All labels preserved verbatim during translation (no label drift)
**Data Quality:**
- No exact duplicates within or across languages
- Balanced class distribution: 13,200 entries per label (1,650 per label per language)
- Validated translations spot-checked for coherence, encoding, and semantic preservation
- All text UTF-8 encoded with proper diacritical marks preserved
## Format
CSV with three columns:
```
language,label,text
EN,command,Close the door
ES,command,Cierra la puerta
FR,command,Ferme la porte
```
- `language`: BCP 47 language code (en, es, fr, de, it, pt, nl)
- `label`: One of {command, exclamation, polar_question, request, statement, wh_question}
- `text`: Sentence in target language (UTF-8)
## Usage
Load with pandas:
```python
import pandas as pd
df = pd.read_csv('sentence_types_multilingual.csv')
# Filter by language: df[df['language'] == 'EN']
# Filter by label: df[df['label'] == 'request']
# Check balance: df['label'].value_counts()
```
## Source & Attribution
Part of the **little-questions** project — a lightweight multilingual question classification library.
Translations generated using **Tower-Plus-2B-GGUF** quantized LLM (Unbabel/Tower-Plus-2B) via llama-cpp-python.
## License
Apache 2.0
提供机构:
TigreGotico



