lighton-ms-marco-mini
收藏魔搭社区2025-11-07 更新2025-01-18 收录
下载链接:
https://modelscope.cn/datasets/lightonai/lighton-ms-marco-mini
下载链接
链接失效反馈官方服务:
资源简介:
# ms-marco-mini
This dataset gathers very few samples from [MS MARCO](https://microsoft.github.io/msmarco/) to provide an example of triplet-based / knowledge distillation dataset formatting.
#### `triplet` subset
The `triplet` file is all we need to fine-tune a model based on contrastive loss.
* Columns: "query", "positive", "negative"
* Column types: `str`, `str`, `str`
* Examples:
```python
{
"query": "what are the liberal arts?",
"positive": 'liberal arts. 1. the academic course of instruction at a college intended to provide general knowledge and comprising the arts, humanities, natural sciences, and social sciences, as opposed to professional or technical subjects.',
"negative": 'The New York State Education Department requires 60 Liberal Arts credits in a Bachelor of Science program and 90 Liberal Arts credits in a Bachelor of Arts program. In the list of course descriptions, courses which are liberal arts for all students are identified by (Liberal Arts) after the course number.'
}
```
* Datasets
```python
from datasets import load_dataset
dataset = load_dataset("lightonai/lighton-ms-marco-mini", "triplet", split="train")
```
#### `knowledge distillation` subset
To fine-tune a model using knowledge distillation loss we will need three distinct file:
* Datasets
```python
from datasets import load_dataset
train = load_dataset(
"lightonai/lighton-ms-marco-mini",
"train",
split="train",
)
queries = load_dataset(
"lightonai/lighton-ms-marco-mini",
"queries",
split="train",
)
documents = load_dataset(
"lightonai/lighton-ms-marco-mini",
"documents",
split="train",
)
```
Where:
- `train` contains three distinct columns: `['query_id', 'document_ids', 'scores']`
```python
{
"query_id": 54528,
"document_ids": [
6862419,
335116,
339186,
7509316,
7361291,
7416534,
5789936,
5645247,
],
"scores": [
0.4546215673141326,
0.6575686537173476,
0.26825184192900203,
0.5256195579370395,
0.879939718687207,
0.7894968184862693,
0.6450100468854655,
0.5823844608171467,
],
}
```
Assert that the length of document_ids is the same as scores.
- `queries` contains two distinct columns: `['query_id', 'text']`
```python
{"query_id": 749480, "text": "what is function of magnesium in human body"}
```
- `documents` contains two distinct columns: `['document_ids', 'text']`
```python
{
"document_id": 136062,
"text": "2. Also called tan .a fundamental trigonometric function that, in a right triangle, is expressed as the ratio of the side opposite an acute angle to the side adjacent to that angle. 3. in immediate physical contact; touching; abutting. 4. a. touching at a single point, as a tangent in relation to a curve or surface.lso called tan .a fundamental trigonometric function that, in a right triangle, is expressed as the ratio of the side opposite an acute angle to the side adjacent to that angle. 3. in immediate physical contact; touching; abutting. 4. a. touching at a single point, as a tangent in relation to a curve or surface.",
}
```
# ms-marco-mini
本数据集从[MS MARCO](https://microsoft.github.io/msmarco/)中选取少量样本,用于展示基于三元组(triplet)/知识蒸馏(knowledge distillation)的数据集格式化范式。
#### `triplet` 子集
`triplet` 文件是基于对比损失(contrastive loss)微调模型所需的完整数据。
* 列名:"query", "positive", "negative"
* 列类型:`str`(字符串类型),`str`(字符串类型),`str`(字符串类型)
* 示例:
python
{
"query": "what are the liberal arts?",
"positive": 'liberal arts. 1. the academic course of instruction at a college intended to provide general knowledge and comprising the arts, humanities, natural sciences, and social sciences, as opposed to professional or technical subjects.',
"negative": 'The New York State Education Department requires 60 Liberal Arts credits in a Bachelor of Science program and 90 Liberal Arts credits in a Bachelor of Arts program. In the list of course descriptions, courses which are liberal arts for all students are identified by (Liberal Arts) after the course number.'
}
* 数据集加载代码:
python
from datasets import load_dataset
dataset = load_dataset("lightonai/lighton-ms-marco-mini", "triplet", split="train")
#### `knowledge distillation`(知识蒸馏)子集
若需使用知识蒸馏损失(knowledge distillation loss)微调模型,则需要三个独立的数据文件:
* 数据集加载代码:
python
from datasets import load_dataset
train = load_dataset(
"lightonai/lighton-ms-marco-mini",
"train",
split="train",
)
queries = load_dataset(
"lightonai/lighton-ms-marco-mini",
"queries",
split="train",
)
documents = load_dataset(
"lightonai/lighton-ms-marco-mini",
"documents",
split="train",
)
其中:
- `train` 包含三列:`['query_id', 'document_ids', 'scores']`
python
{
"query_id": 54528,
"document_ids": [
6862419,
335116,
339186,
7509316,
7361291,
7416534,
5789936,
5645247,
],
"scores": [
0.4546215673141326,
0.6575686537173476,
0.26825184192900203,
0.5256195579370395,
0.879939718687207,
0.7894968184862693,
0.6450100468854655,
0.5823844608171467,
],
}
需保证`document_ids`与`scores`的长度一致。
- `queries` 包含两列:`['query_id', 'text']`
python
{"query_id": 749480, "text": "what is function of magnesium in human body"}
- `documents` 包含两列:`['document_ids', 'text']`
python
{
"document_id": 136062,
"text": "2. Also called tan .a fundamental trigonometric function that, in a right triangle, is expressed as the ratio of the side opposite an acute angle to the side adjacent to that angle. 3. in immediate physical contact; touching; abutting. 4. a. touching at a single point, as a tangent in relation to a curve or surface.lso called tan .a fundamental trigonometric function that, in a right triangle, is expressed as the ratio of the side opposite an acute angle to the side adjacent to that angle. 3. in immediate physical contact; touching; abutting. 4. a. touching at a single point, as a tangent in relation to a curve or surface.",
}
提供机构:
maas
创建时间:
2025-01-17



