five

PITTI/MicRou

收藏
Hugging Face2024-03-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/PITTI/MicRou
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 language: - fr tags: - legal - finance pretty_name: MicRou size_categories: - n<1K --- # MicRou ## Introduction The documents that constitute the dataset were gathered for a RAG project in memory of [Michel Rouger](https://www.pitti.io/articles/michel-rouger) : the documents were part of his personal archives and include his own work as well as work produced by other authors during projects he ran. ## Datasets The [MicRou repository](https://github.com/pappitti/MicRou) includes 2 datasets in French: 1. microu This dataset includes approximately 850 documents in French (books, articles, minutes of debates) produced between 1998 and 2020. It covers justice and law, finance and economics, management, healthcare, education, sports, history and geopolitics... Overall it represents between 1.5m and 2m tokens depending on the tokenizer you use. In many cases, the documents stem from a larger source that was broken down as parts could be considered independently (e.g. different chapters of a book or different articles of a newsletter). It is nonetheless possible to recombine the entire source : within a "dossier", you can group by date and, within each group, order by index. Documents that do not come from a larger source have an index of 0 by default. 2. microu-chunked As part of the RAG projet, we used an embeddings model, [Solon](https://huggingface.co/OrdalieTech/Solon-embeddings-large-0.1), with a context window of 512 tokens so we had to split the MicRou dataset into chunks. This is the resulting dataset. ## License (CC-BY-NC-SA-4.0) The dataset is currently under restrictive license. We plan to convert it to an open license once we have finalized the review of the right holders. Some documents may be excluded following the review, but we also plan to add others over time.
提供机构:
PITTI
原始信息汇总

MicRou 数据集概述

简介

MicRou 数据集由 Michel Rouger 的个人档案中的文档组成,这些文档包括他自己的工作以及其他作者在他主持的项目中产生的工作。

数据集内容

MicRou 数据集包含两个法语数据集:

  1. microu

    • 包含约 850 份法语文档(书籍、文章、辩论记录),时间跨度为 1998 年至 2020 年。
    • 涵盖领域包括司法和法律、金融和经济、管理、医疗保健、教育、体育、历史和地缘政治等。
    • 总词汇量在 150 万到 200 万之间,具体取决于使用的分词器。
    • 许多文档来自较大的源材料,可以独立考虑的部分被拆分。可以通过日期和索引重新组合整个源材料。
  2. microu-chunked

    • 作为 RAG 项目的一部分,使用了一个上下文窗口为 512 个词汇的嵌入模型 Solon,因此将 MicRou 数据集分割成块。

许可

数据集目前采用限制性许可(CC-BY-NC-SA-4.0)。计划在完成权利持有者的审查后转换为开放许可。审查过程中可能会排除一些文档,但也计划逐步添加其他文档。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作