LorenzoVentrone/SentenceSplitter-dataset

Name: LorenzoVentrone/SentenceSplitter-dataset
Creator: LorenzoVentrone
Published: 2026-03-31 11:09:39
License: 暂无描述

Hugging Face2026-03-31 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/LorenzoVentrone/SentenceSplitter-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - it - en tags: - sentence-boundary-detection - token-classification - legal-nlp - multilingual task_categories: - token-classification pretty_name: SentenceSplitter Dataset size_categories: - 1K<n<10K --- # SentenceSplitter Dataset ## Dataset Description This dataset is designed for Sentence Boundary Disambiguation (SBD) as a token classification task. Each sample uses the schema: - `tokens`: list of token strings - `ner_tags`: list of integer labels aligned with `tokens` - `0` = not end of sentence - `1` = end of sentence The dataset is intended for multilingual SBD, with focus on Italian and English, and includes both domain-specific and adversarial patterns. ## Data Sources The training corpus is created by merging: 1. Professor corpus from `sent_split_data.tar.gz` 2. MultiLegalSBD legal JSONL corpora 3. Wikipedia (`20231101.it`, `20231101.en`) Current filtering rules used in data preparation: - Only professor files ending with `-train.sent_split` - Only legal files ending with `*train.jsonl` These filters are used to avoid dev/test leakage from source corpora. ## Dataset Splits Published splits in this dataset repo: - `train`: 1591 rows - `validation`: 177 rows - `test_adversarial`: 59 rows All splits use the same features: - `tokens` - `ner_tags` ## How Splits Are Built - `train` and `validation` are derived from `unified_training_dataset` with `train_test_split(test_size=0.1, seed=42)`. - `test_adversarial` is loaded from `comprehensive_test_dataset` generated by the project testset pipeline. ## Intended Uses - Training and evaluating SBD models for legal/academic/general text. - Robustness checks on punctuation-heavy and abbreviation-heavy inputs. - Benchmarking token-classification approaches for sentence segmentation. ## Limitations - The adversarial split is intentionally difficult and may not represent natural document frequency. - Source corpora come from different domains and annotation strategies. - Performance can vary on domains not represented by legal, academic, or encyclopedic text. ## Reproducibility Notes Core preprocessing choices: - Sliding window size: 128 - Stride: 100 - Whitespace tokenization at dataset construction stage - Label alignment to token-level EOS boundaries Recommended practice: - Use `validation` for tuning - Keep `test_adversarial` for final robustness evaluation

提供机构：

LorenzoVentrone

5,000+

优质数据集

54 个

任务类型

进入经典数据集