five

garbage_competition

收藏
魔搭社区2026-04-23 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/swift/garbage_competition
下载链接
链接失效反馈
官方服务:
资源简介:
## 数据集 数据集地址:https://modelscope.cn/datasets/swift/garbage_competition 数据集下载: ```python from modelscope import MsDataset dataset = MsDataset.load('swift/garbage_competition', split='train') test_dataset = MsDataset.load('swift/garbage_competition', split='test') print(dataset) print(test_dataset) """ Dataset({ features: ['images', 'label', 'label_name'], num_rows: 100000 }) Dataset({ features: ['images', 'label', 'label_name'], num_rows: 2650 }) """ print(dataset[0]) print(test_dataset[0]) """ {'images': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=750x643 at 0x7F03D82937C0>, 'label': 169, 'label_name': '可回收物-电话'} {'images': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=720x540 at 0x7F03D81FCD00>, 'label': None, 'label_name': None} """ ``` ## Baseline 以下介绍使用ms-swift大模型训练框架,对Qwen2.5-VL-3B-Instruct使用该数据集进行LoRA微调的baseline - ms-swift github:https://github.com/modelscope/ms-swift - 以下提供的baseline所需显存资源14GB,可在魔搭免费算力A10上运行 - 该baseline的准确率为: 0.8528301886792453 环境准备: ```shell pip install ms-swift -U ``` 单卡训练: ```python # GPU Memory: 14GB import os from typing import Dict, Any os.environ['CUDA_VISIBLE_DEVICES'] = '0' os.environ['MAX_PIXELS'] = str(1280 * 28 * 28) from swift.llm import ( TrainArguments, sft_main, register_dataset, DatasetMeta, ResponsePreprocessor, SubsetDataset ) class CustomPreprocessor(ResponsePreprocessor): def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]: row['query'] = 'Task: Sorting Waste.' return super().preprocess(row) register_dataset( DatasetMeta( ms_dataset_id='swift/garbage_competition', preprocess_func=CustomPreprocessor(), subsets=[SubsetDataset('train', split=['train']), SubsetDataset('test', split=['test'])] )) if __name__ == '__main__': sft_main(TrainArguments( model='Qwen/Qwen2.5-VL-3B-Instruct', dataset=['swift/garbage_competition:train#20000'], # 节约时间,只选择20000条数据集 train_type='lora', torch_dtype='bfloat16', num_train_epochs=1, per_device_train_batch_size=1, per_device_eval_batch_size=1, learning_rate=1e-4, lora_rank=8, lora_alpha=32, target_modules=['all-linear'], freeze_vit=True, gradient_accumulation_steps=16, eval_steps=50, save_steps=50, save_total_limit=2, logging_steps=5, max_length=2048, output_dir='output', warmup_ratio=0.05, dataset_num_proc=4, dataloader_num_workers=4, num_labels=265, task_type='seq_cls', use_chat_template=False )) ``` ## 提交结果 我们提供了推理脚本, 最终需要将以下推理脚本产生的`infer_result`目录中的jsonl文件进行提交 (由于比赛界面只允许传递json后缀的文件, 请重命名为`result.json`, 不需要改内容). ```python # GPU Memory: 14GB import os from typing import Dict, Any os.environ['CUDA_VISIBLE_DEVICES'] = '0' os.environ['MAX_PIXELS'] = str(1280 * 28 * 28) from swift.llm import ( InferArguments, infer_main, register_dataset, DatasetMeta, ResponsePreprocessor, SubsetDataset ) class CustomPreprocessor(ResponsePreprocessor): def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]: row['query'] = 'Task: Sorting Waste.' return super().preprocess(row) register_dataset( DatasetMeta( ms_dataset_id='swift/garbage_competition', preprocess_func=CustomPreprocessor(), subsets=[SubsetDataset('train', split=['train']), SubsetDataset('test', split=['test'])] )) ckpt_dir = 'output/vx-xxx/checkpoint-xxx' # last_checkpoint result = infer_main(InferArguments( adapters=[ckpt_dir], temperature=0, val_dataset="swift/garbage_competition:test", infer_backend='pt')) # 结果会保存在`{ckpt_dir}/infer_result/xxx-xxx.jsonl`中, 提交该文件即可. # (由于比赛界面只能传递json后缀的文件, 请重命名为`result.json`, 不需要改内容). ``` 提交的jsonl文件格式如下,顺序与`test_dataset`顺序一致,共2650条, ``` {"response": "25"} {"response": "xxx"} {"response": "xxx"} ``` 打分脚本: ```python from swift.utils import read_from_jsonl from datasets import load_dataset labels = load_dataset('parquet', data_files='test_with_labels.parquet', split='train')['label'] results = read_from_jsonl('result.jsonl') count = 0 for i, (res, label) in enumerate(zip(results, labels)): if int(res['response']) == label: count += 1 print(f'acc: {count / len(results)}') ```

# 数据集 数据集地址:https://modelscope.cn/datasets/swift/garbage_competition 数据集下载: python from modelscope import MsDataset dataset = MsDataset.load('swift/garbage_competition', split='train') test_dataset = MsDataset.load('swift/garbage_competition', split='test') print(dataset) print(test_dataset) """ Dataset({ features: ['images', 'label', 'label_name'], num_rows: 100000 }) Dataset({ features: ['images', 'label', 'label_name'], num_rows: 2650 }) """ print(dataset[0]) print(test_dataset[0]) """ {'images': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=750x643 at 0x7F03D82937C0>, 'label': 169, 'label_name': '可回收物-电话'} {'images': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=720x540 at 0x7F03D81FCD00>, 'label': None, 'label_name': None} """ ## 基准基线方案 下文将介绍基于ms-swift大模型训练框架,针对Qwen2.5-VL-3B-Instruct模型,使用本数据集开展LoRA低秩适配微调的基准方案: - ms-swift 仓库地址:https://github.com/modelscope/ms-swift - 本基准方案所需显存为14GB,可在魔搭平台免费提供的A10算力设备上运行 - 本基准方案的准确率为:0.8528301886792453 ### 环境配置 shell pip install ms-swift -U 执行以下命令安装并更新ms-swift库: ### 单卡训练脚本 python # 显存占用:14GB import os from typing import Dict, Any os.environ['CUDA_VISIBLE_DEVICES'] = '0' os.environ['MAX_PIXELS'] = str(1280 * 28 * 28) from swift.llm import ( TrainArguments, sft_main, register_dataset, DatasetMeta, ResponsePreprocessor, SubsetDataset ) class CustomPreprocessor(ResponsePreprocessor): def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]: row['query'] = '任务:垃圾分类。' return super().preprocess(row) register_dataset( DatasetMeta( ms_dataset_id='swift/garbage_competition', preprocess_func=CustomPreprocessor(), subsets=[SubsetDataset('train', split=['train']), SubsetDataset('test', split=['test'])] )) if __name__ == '__main__': sft_main(TrainArguments( model='Qwen/Qwen2.5-VL-3B-Instruct', dataset=['swift/garbage_competition:train#20000'], # 为缩短训练时长,仅选取20000条训练样本 train_type='lora', torch_dtype='bfloat16', num_train_epochs=1, per_device_train_batch_size=1, per_device_eval_batch_size=1, learning_rate=1e-4, lora_rank=8, lora_alpha=32, target_modules=['all-linear'], freeze_vit=True, gradient_accumulation_steps=16, eval_steps=50, save_steps=50, save_total_limit=2, logging_steps=5, max_length=2048, output_dir='output', warmup_ratio=0.05, dataset_num_proc=4, dataloader_num_workers=4, num_labels=265, task_type='seq_cls', use_chat_template=False )) ## 结果提交 我们已提供推理脚本,最终需提交下述推理脚本生成的`infer_result`目录下的jsonl格式文件(因比赛界面仅支持上传json后缀文件,请将其重命名为`result.json`,无需修改文件内容)。 ### 推理脚本 python # 显存占用:14GB import os from typing import Dict, Any os.environ['CUDA_VISIBLE_DEVICES'] = '0' os.environ['MAX_PIXELS'] = str(1280 * 28 * 28) from swift.llm import ( InferArguments, infer_main, register_dataset, DatasetMeta, ResponsePreprocessor, SubsetDataset ) class CustomPreprocessor(ResponsePreprocessor): def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]: row['query'] = '任务:垃圾分类。' return super().preprocess(row) register_dataset( DatasetMeta( ms_dataset_id='swift/garbage_competition', preprocess_func=CustomPreprocessor(), subsets=[SubsetDataset('train', split=['train']), SubsetDataset('test', split=['test'])] )) ckpt_dir = 'output/vx-xxx/checkpoint-xxx' # 最新训练的检查点目录 result = infer_main(InferArguments( adapters=[ckpt_dir], temperature=0, val_dataset="swift/garbage_competition:test", infer_backend='pt')) # 推理结果将保存至`{ckpt_dir}/infer_result/xxx-xxx.jsonl`路径下,提交该文件即可。 # (因比赛界面仅支持上传json后缀文件,请将其重命名为`result.json`,无需修改文件内容) ### 提交文件格式 提交的jsonl文件格式如下,样本顺序需与`test_dataset`的样本顺序保持一致,总计2650条: {"response": "25"} {"response": "xxx"} {"response": "xxx"} ### 打分脚本 python from swift.utils import read_from_jsonl from datasets import load_dataset # 加载带有标签的测试集parquet文件,提取label字段作为标准答案 labels = load_dataset('parquet', data_files='test_with_labels.parquet', split='train')['label'] # 读取提交的result.jsonl文件中的推理结果 results = read_from_jsonl('result.jsonl') count = 0 for i, (res, label) in enumerate(zip(results, labels)): if int(res['response']) == label: count += 1 # 计算并输出准确率 print(f'acc: {count / len(results)}')
提供机构:
maas
创建时间:
2025-02-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作