five

lighton-ms-marco-mini

收藏
魔搭社区2025-11-07 更新2025-01-18 收录
下载链接:
https://modelscope.cn/datasets/lightonai/lighton-ms-marco-mini
下载链接
链接失效反馈
官方服务:
资源简介:
# ms-marco-mini This dataset gathers very few samples from [MS MARCO](https://microsoft.github.io/msmarco/) to provide an example of triplet-based / knowledge distillation dataset formatting. #### `triplet` subset The `triplet` file is all we need to fine-tune a model based on contrastive loss. * Columns: "query", "positive", "negative" * Column types: `str`, `str`, `str` * Examples: ```python { "query": "what are the liberal arts?", "positive": 'liberal arts. 1. the academic course of instruction at a college intended to provide general knowledge and comprising the arts, humanities, natural sciences, and social sciences, as opposed to professional or technical subjects.', "negative": 'The New York State Education Department requires 60 Liberal Arts credits in a Bachelor of Science program and 90 Liberal Arts credits in a Bachelor of Arts program. In the list of course descriptions, courses which are liberal arts for all students are identified by (Liberal Arts) after the course number.' } ``` * Datasets ```python from datasets import load_dataset dataset = load_dataset("lightonai/lighton-ms-marco-mini", "triplet", split="train") ``` #### `knowledge distillation` subset To fine-tune a model using knowledge distillation loss we will need three distinct file: * Datasets ```python from datasets import load_dataset train = load_dataset( "lightonai/lighton-ms-marco-mini", "train", split="train", ) queries = load_dataset( "lightonai/lighton-ms-marco-mini", "queries", split="train", ) documents = load_dataset( "lightonai/lighton-ms-marco-mini", "documents", split="train", ) ``` Where: - `train` contains three distinct columns: `['query_id', 'document_ids', 'scores']` ```python { "query_id": 54528, "document_ids": [ 6862419, 335116, 339186, 7509316, 7361291, 7416534, 5789936, 5645247, ], "scores": [ 0.4546215673141326, 0.6575686537173476, 0.26825184192900203, 0.5256195579370395, 0.879939718687207, 0.7894968184862693, 0.6450100468854655, 0.5823844608171467, ], } ``` Assert that the length of document_ids is the same as scores. - `queries` contains two distinct columns: `['query_id', 'text']` ```python {"query_id": 749480, "text": "what is function of magnesium in human body"} ``` - `documents` contains two distinct columns: `['document_ids', 'text']` ```python { "document_id": 136062, "text": "2. Also called tan .a fundamental trigonometric function that, in a right triangle, is expressed as the ratio of the side opposite an acute angle to the side adjacent to that angle. 3. in immediate physical contact; touching; abutting. 4. a. touching at a single point, as a tangent in relation to a curve or surface.lso called tan .a fundamental trigonometric function that, in a right triangle, is expressed as the ratio of the side opposite an acute angle to the side adjacent to that angle. 3. in immediate physical contact; touching; abutting. 4. a. touching at a single point, as a tangent in relation to a curve or surface.", } ```

# ms-marco-mini 本数据集从[MS MARCO](https://microsoft.github.io/msmarco/)中选取少量样本,用于展示基于三元组(triplet)/知识蒸馏(knowledge distillation)的数据集格式化范式。 #### `triplet` 子集 `triplet` 文件是基于对比损失(contrastive loss)微调模型所需的完整数据。 * 列名:"query", "positive", "negative" * 列类型:`str`(字符串类型),`str`(字符串类型),`str`(字符串类型) * 示例: python { "query": "what are the liberal arts?", "positive": 'liberal arts. 1. the academic course of instruction at a college intended to provide general knowledge and comprising the arts, humanities, natural sciences, and social sciences, as opposed to professional or technical subjects.', "negative": 'The New York State Education Department requires 60 Liberal Arts credits in a Bachelor of Science program and 90 Liberal Arts credits in a Bachelor of Arts program. In the list of course descriptions, courses which are liberal arts for all students are identified by (Liberal Arts) after the course number.' } * 数据集加载代码: python from datasets import load_dataset dataset = load_dataset("lightonai/lighton-ms-marco-mini", "triplet", split="train") #### `knowledge distillation`(知识蒸馏)子集 若需使用知识蒸馏损失(knowledge distillation loss)微调模型,则需要三个独立的数据文件: * 数据集加载代码: python from datasets import load_dataset train = load_dataset( "lightonai/lighton-ms-marco-mini", "train", split="train", ) queries = load_dataset( "lightonai/lighton-ms-marco-mini", "queries", split="train", ) documents = load_dataset( "lightonai/lighton-ms-marco-mini", "documents", split="train", ) 其中: - `train` 包含三列:`['query_id', 'document_ids', 'scores']` python { "query_id": 54528, "document_ids": [ 6862419, 335116, 339186, 7509316, 7361291, 7416534, 5789936, 5645247, ], "scores": [ 0.4546215673141326, 0.6575686537173476, 0.26825184192900203, 0.5256195579370395, 0.879939718687207, 0.7894968184862693, 0.6450100468854655, 0.5823844608171467, ], } 需保证`document_ids`与`scores`的长度一致。 - `queries` 包含两列:`['query_id', 'text']` python {"query_id": 749480, "text": "what is function of magnesium in human body"} - `documents` 包含两列:`['document_ids', 'text']` python { "document_id": 136062, "text": "2. Also called tan .a fundamental trigonometric function that, in a right triangle, is expressed as the ratio of the side opposite an acute angle to the side adjacent to that angle. 3. in immediate physical contact; touching; abutting. 4. a. touching at a single point, as a tangent in relation to a curve or surface.lso called tan .a fundamental trigonometric function that, in a right triangle, is expressed as the ratio of the side opposite an acute angle to the side adjacent to that angle. 3. in immediate physical contact; touching; abutting. 4. a. touching at a single point, as a tangent in relation to a curve or surface.", }
提供机构:
maas
创建时间:
2025-01-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作