stjiris/IRIS_sts

Name: stjiris/IRIS_sts
Creator: stjiris
Published: 2024-04-17 09:02:17
License: 暂无描述

Hugging Face2024-04-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/stjiris/IRIS_sts

下载链接

链接失效反馈

官方服务：

资源简介：

IRIS Legal Dataset是一个包含葡萄牙最高法院法律句子对的数据集，用于语义文本相似性任务。数据集的语言为葡萄牙语，许可证为MIT，规模大于10K。数据集中的句子对分为三类：0-1分表示随机句子对，2-4分表示来自同一摘要的句子对，4-5分表示通过OpenAI的text-davinci-003生成的句子对。该数据集是作为IRIS项目的一部分开发的，旨在为葡萄牙最高法院开发一个语义搜索系统。

提供机构：

stjiris

原始信息汇总

数据集概述

名称: IRIS Legal Dataset
语言: 葡萄牙语 (pt)
许可证: MIT
多语言性: 单语
大小类别: 大于10K
数据来源: 原始数据
任务类别: 文本分类
任务ID:
- 文本评分
- 语义相似度评分

数据集内容

目标: 用于语义文本相似性分析
数据结构:
- 值0-1: 文档间的随机句子对
- 值2-4: 来自同一摘要的句子对（暗示某种程度的蕴含）
- 值4-5: 通过OpenAI的text-davinci-003生成的句子对

贡献者

@rufimelo99

引用信息

bibtex @InProceedings{MeloSemantic, author="Melo, Rui and Santos, Pedro A. and Dias, Jo{~a}o", editor="Moniz, Nuno and Vale, Zita and Cascalho, Jos{e} and Silva, Catarina and Sebasti{~a}o, Raquel", title="A Semantic Search System for the Supremo Tribunal de Justi{c{c}}a", booktitle="Progress in Artificial Intelligence", year="2023", publisher="Springer Nature Switzerland", address="Cham", pages="142--154", abstract="Many information retrieval systems use lexical approaches to retrieve information. Such approaches have multiple limitations, and these constraints are exacerbated when tied to specific domains, such as the legal one. Large language models, such as BERT, deeply understand a language and may overcome the limitations of older methodologies, such as BM25. This work investigated and developed a prototype of a Semantic Search System to assist the Supremo Tribunal de Justi{c{c}}a (Portuguese Supreme Court of Justice) in its decision-making process. We built a Semantic Search System that uses specially trained BERT models (Legal-BERTimbau variants) and a Hybrid Search System that incorporates both lexical and semantic techniques by combining the capabilities of BM25 and the potential of Legal-BERTimbau. In this context, we obtained a {$}{$}335{ackslash}{%}{$}{$}335{%}increase on the discovery metric when compared to BM25 for the first query result. This work also provides information on the most relevant techniques for training a Large Language Model adapted to Portuguese jurisprudence and introduces a new technique of Metadata Knowledge Distillation.", isbn="978-3-031-49011-8" }

5,000+

优质数据集

54 个

任务类型

进入经典数据集