five

LorenzoVentrone/SentenceSplitter-dataset

收藏
Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/LorenzoVentrone/SentenceSplitter-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - it - en tags: - sentence-boundary-detection - token-classification - legal-nlp - multilingual task_categories: - token-classification pretty_name: SentenceSplitter Dataset size_categories: - 1K<n<10K --- # SentenceSplitter Dataset ## Dataset Description This dataset is designed for Sentence Boundary Disambiguation (SBD) as a token classification task. Each sample uses the schema: - `tokens`: list of token strings - `ner_tags`: list of integer labels aligned with `tokens` - `0` = not end of sentence - `1` = end of sentence The dataset is intended for multilingual SBD, with focus on Italian and English, and includes both domain-specific and adversarial patterns. ## Data Sources The training corpus is created by merging: 1. Professor corpus from `sent_split_data.tar.gz` 2. MultiLegalSBD legal JSONL corpora 3. Wikipedia (`20231101.it`, `20231101.en`) Current filtering rules used in data preparation: - Only professor files ending with `-train.sent_split` - Only legal files ending with `*train.jsonl` These filters are used to avoid dev/test leakage from source corpora. ## Dataset Splits Published splits in this dataset repo: - `train`: 1591 rows - `validation`: 177 rows - `test_adversarial`: 59 rows All splits use the same features: - `tokens` - `ner_tags` ## How Splits Are Built - `train` and `validation` are derived from `unified_training_dataset` with `train_test_split(test_size=0.1, seed=42)`. - `test_adversarial` is loaded from `comprehensive_test_dataset` generated by the project testset pipeline. ## Intended Uses - Training and evaluating SBD models for legal/academic/general text. - Robustness checks on punctuation-heavy and abbreviation-heavy inputs. - Benchmarking token-classification approaches for sentence segmentation. ## Limitations - The adversarial split is intentionally difficult and may not represent natural document frequency. - Source corpora come from different domains and annotation strategies. - Performance can vary on domains not represented by legal, academic, or encyclopedic text. ## Reproducibility Notes Core preprocessing choices: - Sliding window size: 128 - Stride: 100 - Whitespace tokenization at dataset construction stage - Label alignment to token-level EOS boundaries Recommended practice: - Use `validation` for tuning - Keep `test_adversarial` for final robustness evaluation
提供机构:
LorenzoVentrone
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作