five

NNEngine/Sentiment-Analysis-Complex

收藏
Hugging Face2026-01-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NNEngine/Sentiment-Analysis-Complex
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-classification language: - en tags: - classification, - sentiment-analysis, - binary-classification, - complex-text - jsonl size_categories: - 100K<n<1M --- Excellent — congrats on getting the repo ready 🚀 Here’s a **professional Hugging Face Dataset Card (README.md)** you can paste directly into your repository. This is written to match HF best practices and serious research usage. --- # 📘 README.md 👉 Copy everything below into your `README.md` --- # Sentiment-Analysis-Complex ## 🧠 Overview **Sentiment-Analysis-Complex** is a large-scale synthetic sentiment analysis dataset designed for benchmarking modern NLP models under long-context, noisy, and semi-structured text conditions. The dataset contains **10 million labeled samples** with: * Long text sequences (**20–40 tokens per sample**) * Grammar-based sentence construction * Internet slang and hashtags * Rich vocabulary diversity * Balanced binary sentiment labels It is optimized for: * Transformer benchmarking * Tokenizer stress testing * Long-context modeling * Robustness evaluation * Large-scale NLP pipelines --- ## 📦 Dataset Structure ``` Sentiment-Analysis-Complex/ ├── train.jsonl (8,000,000 samples) ├── test.jsonl (2,000,000 samples) └── README.md ``` Split ratio: * **Train:** 80% * **Test:** 20% --- ## 🧾 Data Format Each line is a JSON object: ```json { "id": 123456, "text": "I really love how this system consistently delivers smooth reliable performance and scalable architecture with intuitive workflow and strong documentation lol #innovation", "label": "positive" } ``` ### Fields | Field | Type | Description | | ------- | ------- | ---------------------------------------- | | `id` | Integer | Unique sample identifier | | `text` | String | Input sentence (20–40 tokens) | | `label` | String | Sentiment class (`positive`, `negative`) | Encoding: UTF-8 (emoji and special characters supported) --- ## 📊 Dataset Characteristics * ✔️ Total samples: **10,000,000** * ✔️ Classes: **positive / negative (balanced)** * ✔️ Sequence length: **20–40 tokens** * ✔️ Vocabulary size: ~300+ words * ✔️ Includes slang and hashtags * ✔️ Grammar-driven generation * ✔️ Streaming-friendly JSONL format --- ## 🔬 Intended Use This dataset is suitable for: * Sentiment classification benchmarking * Large-scale training pipelines * Tokenization analysis * Long-context modeling experiments * Data loading stress tests * Distributed training validation * Synthetic NLP research --- ## ⚠️ Limitations * Synthetic text — not reflective of natural human distribution. * Limited semantic depth and discourse structure. * No real-world bias modeling. * No multilingual coverage (English only). * No sarcasm or pragmatic reasoning. Not recommended for production sentiment systems. --- ## 🤗 How to Load ```python from datasets import load_dataset dataset = load_dataset("NNEngine/Sentiment-Analysis-Complex") print(dataset) ``` Streaming mode: ```python dataset = load_dataset( "NNEngine/Sentiment-Analysis-Complex", streaming=True ) ``` --- ## 🏷️ Tags ``` sentiment-analysis nlp synthetic-data large-scale text-classification benchmark huggingface-dataset long-context ``` --- ## 📜 License MIT License Free for research, education, and experimentation. --- ## ✨ Author Created by **NNEngine** for large-scale NLP benchmarking and experimentation. ---
提供机构:
NNEngine
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作