PITTI/MicRou
收藏Hugging Face2024-03-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/PITTI/MicRou
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
language:
- fr
tags:
- legal
- finance
pretty_name: MicRou
size_categories:
- n<1K
---
# MicRou
## Introduction
The documents that constitute the dataset were gathered for a RAG project in memory of [Michel Rouger](https://www.pitti.io/articles/michel-rouger) : the documents were part of his personal archives and include his own work as well as work produced by other authors during projects he ran.
## Datasets
The [MicRou repository](https://github.com/pappitti/MicRou) includes 2 datasets in French:
1. microu
This dataset includes approximately 850 documents in French (books, articles, minutes of debates) produced between 1998 and 2020. It covers justice and law, finance and economics, management, healthcare, education, sports, history and geopolitics... Overall it represents between 1.5m and 2m tokens depending on the tokenizer you use.
In many cases, the documents stem from a larger source that was broken down as parts could be considered independently (e.g. different chapters of a book or different articles of a newsletter). It is nonetheless possible to recombine the entire source : within a "dossier", you can group by date and, within each group, order by index. Documents that do not come from a larger source have an index of 0 by default.
2. microu-chunked
As part of the RAG projet, we used an embeddings model, [Solon](https://huggingface.co/OrdalieTech/Solon-embeddings-large-0.1), with a context window of 512 tokens so we had to split the MicRou dataset into chunks. This is the resulting dataset.
## License (CC-BY-NC-SA-4.0)
The dataset is currently under restrictive license. We plan to convert it to an open license once we have finalized the review of the right holders. Some documents may be excluded following the review, but we also plan to add others over time.
提供机构:
PITTI
原始信息汇总
MicRou 数据集概述
简介
MicRou 数据集由 Michel Rouger 的个人档案中的文档组成,这些文档包括他自己的工作以及其他作者在他主持的项目中产生的工作。
数据集内容
MicRou 数据集包含两个法语数据集:
-
microu
- 包含约 850 份法语文档(书籍、文章、辩论记录),时间跨度为 1998 年至 2020 年。
- 涵盖领域包括司法和法律、金融和经济、管理、医疗保健、教育、体育、历史和地缘政治等。
- 总词汇量在 150 万到 200 万之间,具体取决于使用的分词器。
- 许多文档来自较大的源材料,可以独立考虑的部分被拆分。可以通过日期和索引重新组合整个源材料。
-
microu-chunked
- 作为 RAG 项目的一部分,使用了一个上下文窗口为 512 个词汇的嵌入模型 Solon,因此将 MicRou 数据集分割成块。
许可
数据集目前采用限制性许可(CC-BY-NC-SA-4.0)。计划在完成权利持有者的审查后转换为开放许可。审查过程中可能会排除一些文档,但也计划逐步添加其他文档。



