five

ufca-llms/quati

收藏
Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ufca-llms/quati
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - pt multilinguality: monolingual task_categories: - text-retrieval task_ids: [] config_names: - default - corpus - queries tags: - text pretty_name: Quati 1M Jua-like size_categories: - 1M<n<10M source_datasets: - unicamp-dl/quati dataset_info: - config_name: default features: - name: query-id dtype: string - name: corpus-id dtype: string - name: score dtype: int64 splits: - name: test num_examples: 1933 - config_name: corpus features: - name: _id dtype: string - name: title dtype: string - name: text dtype: string splits: - name: corpus num_examples: 1000000 - config_name: queries features: - name: _id dtype: string - name: text dtype: string splits: - name: queries num_examples: 200 configs: - config_name: default data_files: - split: test path: qrels/test.jsonl - config_name: corpus data_files: - split: corpus path: corpus.jsonl - config_name: queries data_files: - split: queries path: queries.jsonl --- # Quati 1M Jua-like This dataset is a structural conversion of [unicamp-dl/quati](https://huggingface.co/datasets/unicamp-dl/quati) into a layout compatible with the repository organization used by [ufca-llms/jua](https://huggingface.co/datasets/ufca-llms/jua). It uses the Quati 1M document collection and preserves the source evaluation setup instead of creating synthetic supervised training labels. ## Dataset Summary - `corpus.jsonl`: 1,000,000 passages in JSONL format with fields `_id`, `title`, and `text` - `queries.jsonl`: 200 topics in JSONL format with fields `_id` and `text` - `qrels/test.tsv`: 1,933 evaluation judgments in TSV format - `qrels/test.jsonl`: 1,933 evaluation judgments in JSONL format The source Quati dataset card states that only validation qrels are currently available. For that reason, this conversion includes `test` qrels only and does not create `train` qrels. ## Data Structure ### corpus.jsonl Each line is a JSON object with the following fields: - `_id`: original Quati passage identifier - `title`: empty string placeholder for compatibility with the Jua layout - `text`: passage text Example: ```json {"_id":"clueweb22-pt0000-00-00003_1","title":"","text":"Se você precisar de ajuda..."} ``` ### queries.jsonl Each line is a JSON object with the following fields: - `_id`: query identifier in the form `QUATI-<query_id>-q` - `text`: query text Example: ```json {"_id":"QUATI-1-q","text":"Qual a maior característica da fauna brasileira?"} ``` ### qrels/test.tsv Tab-separated file with header: ```tsv query-id corpus-id score ``` ### qrels/test.jsonl Each line is a JSON object with the following fields: - `query-id` - `corpus-id` - `score` ## Source Mapping The conversion is based on these files from the original Quati dataset: - `quati_1M.tsv` - `topics/quati_all_topics.tsv` - `topics/quati_test_topics.tsv` - `qrels/quati_1M_qrels.txt` Mapping rules: - Quati `passage_id` -> `_id` - Quati `passage` -> `text` - Quati `query_id` -> `QUATI-<query_id>-q` - Quati qrels -> `qrels/test.tsv` and `qrels/test.jsonl` ## Limitations - This is a format conversion, not a new annotation effort. - `title` is empty because the Quati 1M passage file does not provide titles. - No `qrels/train.*` files are included because the source dataset does not publish supervised training qrels for the 1M collection. ## Citation If you use this dataset, please cite the original Quati dataset: ```bibtex @misc{bueno2024quati, title={Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers}, author={Mirelle Bueno and Eduardo Seiti de Oliveira and Rodrigo Nogueira and Roberto A. Lotufo and Jayr Alencar Pereira}, year={2024}, eprint={2404.06976}, archivePrefix={arXiv}, primaryClass={cs.IR} } ```
提供机构:
ufca-llms
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作