garbage_competition
收藏魔搭社区2026-04-23 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/swift/garbage_competition
下载链接
链接失效反馈官方服务:
资源简介:
## 数据集
数据集地址:https://modelscope.cn/datasets/swift/garbage_competition
数据集下载:
```python
from modelscope import MsDataset
dataset = MsDataset.load('swift/garbage_competition', split='train')
test_dataset = MsDataset.load('swift/garbage_competition', split='test')
print(dataset)
print(test_dataset)
"""
Dataset({
features: ['images', 'label', 'label_name'],
num_rows: 100000
})
Dataset({
features: ['images', 'label', 'label_name'],
num_rows: 2650
})
"""
print(dataset[0])
print(test_dataset[0])
"""
{'images': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=750x643 at 0x7F03D82937C0>, 'label': 169, 'label_name': '可回收物-电话'}
{'images': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=720x540 at 0x7F03D81FCD00>, 'label': None, 'label_name': None}
"""
```
## Baseline
以下介绍使用ms-swift大模型训练框架,对Qwen2.5-VL-3B-Instruct使用该数据集进行LoRA微调的baseline
- ms-swift github:https://github.com/modelscope/ms-swift
- 以下提供的baseline所需显存资源14GB,可在魔搭免费算力A10上运行
- 该baseline的准确率为: 0.8528301886792453
环境准备:
```shell
pip install ms-swift -U
```
单卡训练:
```python
# GPU Memory: 14GB
import os
from typing import Dict, Any
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['MAX_PIXELS'] = str(1280 * 28 * 28)
from swift.llm import (
TrainArguments, sft_main, register_dataset, DatasetMeta, ResponsePreprocessor, SubsetDataset
)
class CustomPreprocessor(ResponsePreprocessor):
def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]:
row['query'] = 'Task: Sorting Waste.'
return super().preprocess(row)
register_dataset(
DatasetMeta(
ms_dataset_id='swift/garbage_competition',
preprocess_func=CustomPreprocessor(),
subsets=[SubsetDataset('train', split=['train']), SubsetDataset('test', split=['test'])]
))
if __name__ == '__main__':
sft_main(TrainArguments(
model='Qwen/Qwen2.5-VL-3B-Instruct',
dataset=['swift/garbage_competition:train#20000'], # 节约时间,只选择20000条数据集
train_type='lora',
torch_dtype='bfloat16',
num_train_epochs=1,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
learning_rate=1e-4,
lora_rank=8,
lora_alpha=32,
target_modules=['all-linear'],
freeze_vit=True,
gradient_accumulation_steps=16,
eval_steps=50,
save_steps=50,
save_total_limit=2,
logging_steps=5,
max_length=2048,
output_dir='output',
warmup_ratio=0.05,
dataset_num_proc=4,
dataloader_num_workers=4,
num_labels=265,
task_type='seq_cls',
use_chat_template=False
))
```
## 提交结果
我们提供了推理脚本, 最终需要将以下推理脚本产生的`infer_result`目录中的jsonl文件进行提交 (由于比赛界面只允许传递json后缀的文件, 请重命名为`result.json`, 不需要改内容).
```python
# GPU Memory: 14GB
import os
from typing import Dict, Any
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['MAX_PIXELS'] = str(1280 * 28 * 28)
from swift.llm import (
InferArguments, infer_main, register_dataset, DatasetMeta, ResponsePreprocessor, SubsetDataset
)
class CustomPreprocessor(ResponsePreprocessor):
def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]:
row['query'] = 'Task: Sorting Waste.'
return super().preprocess(row)
register_dataset(
DatasetMeta(
ms_dataset_id='swift/garbage_competition',
preprocess_func=CustomPreprocessor(),
subsets=[SubsetDataset('train', split=['train']), SubsetDataset('test', split=['test'])]
))
ckpt_dir = 'output/vx-xxx/checkpoint-xxx' # last_checkpoint
result = infer_main(InferArguments(
adapters=[ckpt_dir],
temperature=0,
val_dataset="swift/garbage_competition:test",
infer_backend='pt'))
# 结果会保存在`{ckpt_dir}/infer_result/xxx-xxx.jsonl`中, 提交该文件即可.
# (由于比赛界面只能传递json后缀的文件, 请重命名为`result.json`, 不需要改内容).
```
提交的jsonl文件格式如下,顺序与`test_dataset`顺序一致,共2650条,
```
{"response": "25"}
{"response": "xxx"}
{"response": "xxx"}
```
打分脚本:
```python
from swift.utils import read_from_jsonl
from datasets import load_dataset
labels = load_dataset('parquet', data_files='test_with_labels.parquet', split='train')['label']
results = read_from_jsonl('result.jsonl')
count = 0
for i, (res, label) in enumerate(zip(results, labels)):
if int(res['response']) == label:
count += 1
print(f'acc: {count / len(results)}')
```
# 数据集
数据集地址:https://modelscope.cn/datasets/swift/garbage_competition
数据集下载:
python
from modelscope import MsDataset
dataset = MsDataset.load('swift/garbage_competition', split='train')
test_dataset = MsDataset.load('swift/garbage_competition', split='test')
print(dataset)
print(test_dataset)
"""
Dataset({
features: ['images', 'label', 'label_name'],
num_rows: 100000
})
Dataset({
features: ['images', 'label', 'label_name'],
num_rows: 2650
})
"""
print(dataset[0])
print(test_dataset[0])
"""
{'images': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=750x643 at 0x7F03D82937C0>, 'label': 169, 'label_name': '可回收物-电话'}
{'images': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=720x540 at 0x7F03D81FCD00>, 'label': None, 'label_name': None}
"""
## 基准基线方案
下文将介绍基于ms-swift大模型训练框架,针对Qwen2.5-VL-3B-Instruct模型,使用本数据集开展LoRA低秩适配微调的基准方案:
- ms-swift 仓库地址:https://github.com/modelscope/ms-swift
- 本基准方案所需显存为14GB,可在魔搭平台免费提供的A10算力设备上运行
- 本基准方案的准确率为:0.8528301886792453
### 环境配置
shell
pip install ms-swift -U
执行以下命令安装并更新ms-swift库:
### 单卡训练脚本
python
# 显存占用:14GB
import os
from typing import Dict, Any
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['MAX_PIXELS'] = str(1280 * 28 * 28)
from swift.llm import (
TrainArguments, sft_main, register_dataset, DatasetMeta, ResponsePreprocessor, SubsetDataset
)
class CustomPreprocessor(ResponsePreprocessor):
def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]:
row['query'] = '任务:垃圾分类。'
return super().preprocess(row)
register_dataset(
DatasetMeta(
ms_dataset_id='swift/garbage_competition',
preprocess_func=CustomPreprocessor(),
subsets=[SubsetDataset('train', split=['train']), SubsetDataset('test', split=['test'])]
))
if __name__ == '__main__':
sft_main(TrainArguments(
model='Qwen/Qwen2.5-VL-3B-Instruct',
dataset=['swift/garbage_competition:train#20000'], # 为缩短训练时长,仅选取20000条训练样本
train_type='lora',
torch_dtype='bfloat16',
num_train_epochs=1,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
learning_rate=1e-4,
lora_rank=8,
lora_alpha=32,
target_modules=['all-linear'],
freeze_vit=True,
gradient_accumulation_steps=16,
eval_steps=50,
save_steps=50,
save_total_limit=2,
logging_steps=5,
max_length=2048,
output_dir='output',
warmup_ratio=0.05,
dataset_num_proc=4,
dataloader_num_workers=4,
num_labels=265,
task_type='seq_cls',
use_chat_template=False
))
## 结果提交
我们已提供推理脚本,最终需提交下述推理脚本生成的`infer_result`目录下的jsonl格式文件(因比赛界面仅支持上传json后缀文件,请将其重命名为`result.json`,无需修改文件内容)。
### 推理脚本
python
# 显存占用:14GB
import os
from typing import Dict, Any
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['MAX_PIXELS'] = str(1280 * 28 * 28)
from swift.llm import (
InferArguments, infer_main, register_dataset, DatasetMeta, ResponsePreprocessor, SubsetDataset
)
class CustomPreprocessor(ResponsePreprocessor):
def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]:
row['query'] = '任务:垃圾分类。'
return super().preprocess(row)
register_dataset(
DatasetMeta(
ms_dataset_id='swift/garbage_competition',
preprocess_func=CustomPreprocessor(),
subsets=[SubsetDataset('train', split=['train']), SubsetDataset('test', split=['test'])]
))
ckpt_dir = 'output/vx-xxx/checkpoint-xxx' # 最新训练的检查点目录
result = infer_main(InferArguments(
adapters=[ckpt_dir],
temperature=0,
val_dataset="swift/garbage_competition:test",
infer_backend='pt'))
# 推理结果将保存至`{ckpt_dir}/infer_result/xxx-xxx.jsonl`路径下,提交该文件即可。
# (因比赛界面仅支持上传json后缀文件,请将其重命名为`result.json`,无需修改文件内容)
### 提交文件格式
提交的jsonl文件格式如下,样本顺序需与`test_dataset`的样本顺序保持一致,总计2650条:
{"response": "25"}
{"response": "xxx"}
{"response": "xxx"}
### 打分脚本
python
from swift.utils import read_from_jsonl
from datasets import load_dataset
# 加载带有标签的测试集parquet文件,提取label字段作为标准答案
labels = load_dataset('parquet', data_files='test_with_labels.parquet', split='train')['label']
# 读取提交的result.jsonl文件中的推理结果
results = read_from_jsonl('result.jsonl')
count = 0
for i, (res, label) in enumerate(zip(results, labels)):
if int(res['response']) == label:
count += 1
# 计算并输出准确率
print(f'acc: {count / len(results)}')
提供机构:
maas
创建时间:
2025-02-15



