dicta-il/hebrew_suffix_verbal_forms
收藏Hugging Face2024-09-29 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/dicta-il/hebrew_suffix_verbal_forms
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- he
configs:
- config_name: default
data_files:
- split: test
path: data.jsonl
---
# Suffixed Verbal Forms Detection Dataset for Modern Hebrew
## Dataset Summary
This dataset contains annotated Hebrew sentences containing verbal forms that are ambiguous as to whether they include a pronominal suffix or not (e.g., the Hebrew word lamed-yod-mem-daled-vav can be understood as either "he taught him" or "they taught"). The goal of the dataset is to support tasks involving the identification and disambiguation of verbs with pronominal suffixes in Hebrew literature and texts.
The dataset was used for the development of the [OtoBERT](https://huggingface.co/dicta-il/otobert) model released [here](tbd), which aims to identify and simplify these suffixed verb forms in Hebrew texts.
### Example Usage:
```python
from datasets import load_dataset
dataset = load_dataset("dicta-il/hebrew_suffix_verbal_forms")
# Display an example
print(dataset['test'][0])
```
## Dataset Structure
The dataset is provided in the following format:
- `text`: The full sentence containing the ambiguous Hebrew verb.
- `startOffset`: The start position of the ambiguous verb in the sentence.
- `endOffset`: The end position of the ambiguous verb in the sentence.
- `label`: Whether the ambiguous verb contains a pronominal suffix (`With_Suffix`) or does not contain a suffix (`No_Suffix`).
### Sample Entry:
```json
{
"text": "בעולם הישיבות לא היה נהוג ללמוד תנ\"ך באופן רשמי וחידשנו לימוד פרק יומי בבוקר אחרי התפילה כחלק מתכנית היום.",
"startOffset": 48,
"endOffset": 55,
"label": "No_Suffix"
}
```
## Annotations
The dataset has been manually annotated to indicate whether the ambiguous verb in each sentence includes a pronominal suffix or not. The labels are:
- `With_Suffix`: The ambiguous verb contains a pronominal suffix.
- `No_Suffix`: The ambiguous verb does not contain a pronominal suffix.
### Split Information:
- **Test**: Contains 2,589 examples with the `No_Suffix` label, and 264 instances with `With_Suffix` label.
## Use Cases
This dataset is useful for tasks related to:
- Hebrew Natural Language Processing (NLP)
- Morphological analysis in Semitic languages
- Training models for disambiguation of suffixed verbs in Hebrew literature
- Complex Word Identification
## Dataset Creation
The dataset was created by manually annotating naturally occurring Hebrew sentences, primarily sourced from Hebrew literature and newspapers. Each ambiguous verb was labeled according to its correct morphological form (suffixed or not suffixed).
## Citation
If you use this dataset in your research, please cite:
```bibtex
tbd
```
## License
Shield: [![CC BY 4.0][cc-by-shield]][cc-by]
This work is licensed under a
[Creative Commons Attribution 4.0 International License][cc-by].
[![CC BY 4.0][cc-by-image]][cc-by]
[cc-by]: http://creativecommons.org/licenses/by/4.0/
[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg
提供机构:
dicta-il



