five

shtosti/CATS

收藏
Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/shtosti/CATS
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en size_categories: - 10K<n<100K --- # 🐾 Taming CATS — Dataset Collection This repository serves as a **central landing page** for the datasets used in the study: > **Taming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens** 🔗 **Code**: https://github.com/shtosti/taming-CATS 📄 **Paper**: TODO (add link once available) --- ## Overview We provide a **multi-domain collection of datasets** for controllable automatic text simplification (CATS), covering: - 🏥 medical text - 🏛️ public administration - 📚 encyclopedic text All datasets have been: - cleaned and filtered - transformed into a **unified JSON schema** - enriched with **precomputed control attributes**, including: - readability metrics (FKGL, ARI, Dale–Chall) - compression ratios (character and word level) The unified format enables consistent **training, evaluation, and cross-domain comparison**. --- ## Available Datasets The following datasets are publicly available on Hugging Face: ### 🏥 Med-EASi (Medical Domain) - 🔗 https://huggingface.co/datasets/shtosti/Med-EASi - 📌 `shtosti/Med-EASi` ### 🏛️ SimPA (Public Administration Domain) - 🔗 https://huggingface.co/datasets/shtosti/SimPA - 📌 `shtosti/SimPA` ### 📚 WikiLarge (Encyclopedic Domain) - 🔗 https://huggingface.co/datasets/shtosti/WikiLarge_ori_splitwise - 📌 `shtosti/WikiLarge_ori_splitwise` Each dataset includes: - standardized JSONL format - precomputed readability and compression metrics - train / validation / test splits - preprocessing and filtering as described in the paper ## Newsela Dataset The **Newsela** dataset was also used in this study but **cannot be redistributed** due to licensing restrictions. Researchers can obtain access through the official Newsela data release. --- ## Data Format All datasets follow a unified structure, including: - `source_text` - `simplification_text` - `source_metrics` - `target_metrics` - metadata (domain, dataset, annotation type, etc.) This schema enables **direct use for controllable generation tasks** without additional preprocessing. --- ## License This repository serves as an **index of datasets** and does not contain the datasets themselves. Each dataset is distributed under its **original license**. Please refer to the individual dataset pages for detailed licensing information. --- ## Citation If you use these datasets, please cite: TODO add once available
提供机构:
shtosti
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作