ziaddddd/arabic-legal-embedding-dataset-nli
收藏Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ziaddddd/arabic-legal-embedding-dataset-nli
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: sentence_a
dtype: string
- name: sentence_b
dtype: string
- name: score
dtype: float64
- name: relation
dtype: string
- name: passive
dtype: string
- name: hard_negative
dtype: string
splits:
- name: train
num_bytes: 2445963
num_examples: 8214
download_size: 705219
dataset_size: 2445963
---
# Arabic Legal Embedding Dataset (NLI & Semantic Similarity)
## Overview
The **Arabic Legal Embedding Dataset** is a high-quality dataset designed for **semantic similarity**, **Natural Language Inference (NLI)**, and **sentence embedding tasks** in Arabic. It focuses on **legal, administrative, and general-purpose text**, providing a rich set of sentence pairs annotated with semantic relationships. This dataset enables building **robust AI models** capable of understanding nuanced meanings in Arabic legal and administrative contexts.
## Key Features
- **Semantic Annotations**: Each sentence pair includes a `relation` indicating its semantic type:
- `summary` – concise summary of the first sentence
- `synonym` – paraphrases or semantically equivalent sentences
- `paraphrase` – sentences reworded differently but retaining the same meaning
- `contradiction` – sentences with opposing meanings
- `weak` – loosely related or contextually weak sentences
- `hard_negative` – sentences appearing similar but semantically unrelated
- `entailment` – one sentence logically follows from the other
- `passive` – rephrasing in passive voice
- **Similarity Scores**: Each pair is scored between 0 and 1, quantifying semantic similarity.
- **Legal & Administrative Focus**: Ideal for training **embeddings, NLI models, or semantic search systems** for Arabic legal documents, contracts, administrative texts, and general-purpose applications.
## Dataset Structure
| Field | Type | Description |
|-------|------|-------------|
| `sentence_a` | str | First sentence |
| `sentence_b` | str | Second sentence |
| `score` | float | Semantic similarity score (0–1) |
| `relation` | str | Type of relation (`summary`, `synonym`, `paraphrase`, `contradiction`, `weak`, `hard_negative`, `entailment`, `passive`) |
## Example Entries
```json
{
"sentence_a": "يرجى التوجه إلى كاتب الجلسة لوضع توقيعك في السجل الرسمي الذي يثبت حضورك للدفاع اليوم.",
"sentence_b": "يجب التوقيع على إثبات حضور الجلسة في المحضر الرسمي لضمان الحقوق الإجرائية.",
"score": 0.93,
"relation": "summary"
}
{
"sentence_a": "إحنا محتاجين مترجم محلف عشان نترجم المستندات دي.",
"sentence_b": "إحنا محتاجين خبير فني يترجم الرموز اللي موجودة في كشوف الحسابات.",
"score": 0.58,
"relation": "hard_negative"
}
{
"sentence_a": "الجلسة كانت زحمة جداً وما قدرناش نتكلم غير كلمتين.",
"sentence_b": "شُهد ازدحام شديد في الجلسة ولم يُسمح إلا بحديث مقتضب.",
"score": 0.85,
"relation": "passive"
}
{
"sentence_a": "عم نحاول نتواصل مع المندوب بس تلفونه مغلق.",
"sentence_b": "هناك صعوبة مؤقتة في الوصول لجهة التوصيل المسؤولة.",
"score": 0.82,
"relation": "entailment"
}
{
"sentence_a": "عم نحاول نتواصل مع المندوب بس تلفونه مغلق.",
"sentence_b": "هناك صعوبة مؤقتة في الوصول لجهة التواصل المسؤولة.",
"score": 0.82,
"relation": "paraphrase"
}
提供机构:
ziaddddd



