ziaddddd/arabic-legal-embedding-dataset-nli

Name: ziaddddd/arabic-legal-embedding-dataset-nli
Creator: ziaddddd
Published: 2026-03-31 14:25:22
License: 暂无描述

Hugging Face2026-03-31 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ziaddddd/arabic-legal-embedding-dataset-nli

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: sentence_a dtype: string - name: sentence_b dtype: string - name: score dtype: float64 - name: relation dtype: string - name: passive dtype: string - name: hard_negative dtype: string splits: - name: train num_bytes: 2445963 num_examples: 8214 download_size: 705219 dataset_size: 2445963 --- # Arabic Legal Embedding Dataset (NLI & Semantic Similarity) ## Overview The **Arabic Legal Embedding Dataset** is a high-quality dataset designed for **semantic similarity**, **Natural Language Inference (NLI)**, and **sentence embedding tasks** in Arabic. It focuses on **legal, administrative, and general-purpose text**, providing a rich set of sentence pairs annotated with semantic relationships. This dataset enables building **robust AI models** capable of understanding nuanced meanings in Arabic legal and administrative contexts. ## Key Features - **Semantic Annotations**: Each sentence pair includes a `relation` indicating its semantic type: - `summary` – concise summary of the first sentence - `synonym` – paraphrases or semantically equivalent sentences - `paraphrase` – sentences reworded differently but retaining the same meaning - `contradiction` – sentences with opposing meanings - `weak` – loosely related or contextually weak sentences - `hard_negative` – sentences appearing similar but semantically unrelated - `entailment` – one sentence logically follows from the other - `passive` – rephrasing in passive voice - **Similarity Scores**: Each pair is scored between 0 and 1, quantifying semantic similarity. - **Legal & Administrative Focus**: Ideal for training **embeddings, NLI models, or semantic search systems** for Arabic legal documents, contracts, administrative texts, and general-purpose applications. ## Dataset Structure | Field | Type | Description | |-------|------|-------------| | `sentence_a` | str | First sentence | | `sentence_b` | str | Second sentence | | `score` | float | Semantic similarity score (0–1) | | `relation` | str | Type of relation (`summary`, `synonym`, `paraphrase`, `contradiction`, `weak`, `hard_negative`, `entailment`, `passive`) | ## Example Entries ```json { "sentence_a": "يرجى التوجه إلى كاتب الجلسة لوضع توقيعك في السجل الرسمي الذي يثبت حضورك للدفاع اليوم.", "sentence_b": "يجب التوقيع على إثبات حضور الجلسة في المحضر الرسمي لضمان الحقوق الإجرائية.", "score": 0.93, "relation": "summary" } { "sentence_a": "إحنا محتاجين مترجم محلف عشان نترجم المستندات دي.", "sentence_b": "إحنا محتاجين خبير فني يترجم الرموز اللي موجودة في كشوف الحسابات.", "score": 0.58, "relation": "hard_negative" } { "sentence_a": "الجلسة كانت زحمة جداً وما قدرناش نتكلم غير كلمتين.", "sentence_b": "شُهد ازدحام شديد في الجلسة ولم يُسمح إلا بحديث مقتضب.", "score": 0.85, "relation": "passive" } { "sentence_a": "عم نحاول نتواصل مع المندوب بس تلفونه مغلق.", "sentence_b": "هناك صعوبة مؤقتة في الوصول لجهة التوصيل المسؤولة.", "score": 0.82, "relation": "entailment" } { "sentence_a": "عم نحاول نتواصل مع المندوب بس تلفونه مغلق.", "sentence_b": "هناك صعوبة مؤقتة في الوصول لجهة التواصل المسؤولة.", "score": 0.82, "relation": "paraphrase" }

提供机构：

ziaddddd

5,000+

优质数据集

54 个

任务类型

进入经典数据集