End-to-end example-based sim-to-real RL policy transfer based on neural stylisation with application to robotic cutting

Name: End-to-end example-based sim-to-real RL policy transfer based on neural stylisation with application to robotic cutting
Creator: figshare
Published: 2025-11-09 00:10:09
License: 暂无描述

DataCite Commons2025-11-09 更新2026-02-09 收录

下载链接：

https://figshare.com/articles/dataset/End-to-end_example-based_sim-to-real_RL_policy_transfer_based_on_neural_stylisation_with_application_to_robotic_cutting/28983659/1

下载链接

链接失效反馈

官方服务：

资源简介：

This repository contains the source datasets used to generate the variational autoencoder model and datasets for style transfer for the paper "End-to-end example-based sim-to-real RL policy transfer based on neural stylisation with application to robotic cutting" submitted to Nature Scientific Reports.Summary of contents of this repository:vae.zipThis archive contains csvs split into training (train) and validation (val) datasets used to train the variational autoencoder and its conditional variant, as well as CycleGAN generator and discriminator networks. The train and val folders are respectively split into simulated (sim) and real datasets. The contents are as follows:train:sim: 680 itemsreal: 118 itemsval:sim: 20 itemsreal: 30 itemsEach item contains episodic trajectories from contact cutting experiments using a rotary slitting saw tool with different process parameter selection strategies. In simulation, these were taken with pitch angle 0.126rad, radius 25mm, cutter width 0.0005m, 50 cutting elements (flutes) and spindle speed 1000rpm with variable radial depth of cut and feed rates, material mechanistic constants and geometries. In the real world, the same conditions were used with 100 cutting elements and spindle speed 500rpm. The contents of each item are as follows:ee_pos_X - XYZ end-effector position (located at cutter tip)ee_quat_X - WXYZ (scalar-first) quaternion representing end-effector orientationtime_reward - reward for time elapsed prior to task completiondev_reward - reward for deviation from reference pathproductivity_reward - reward heuristic for material removal volumeforce_reward - reward for tool load, measured by force-torque sensorobs_0 - component of velocity parallel to reference pathobs_1-3 - error relative to reference pathobs_4-6 - end-effector velocityobs_7-9 - measured force at force-torque sensorobs_10-12 - measured torque at force-torque sensorobs_13 - time offset to nominal path (represented as time-parameterised B-spline)obs_14 - depth of cut offset to nominal pathobs_15-17 - operational space control stiffnessaction_0 - feed rate adjustment ([0.1x-2x], normalised to range [0-1]) to relative to nominal feed rate (1.5m/min for sim, 0.75m/min for real)action_1 - time derivative of depth of cut offsetaction_2-4 - operational space control stiffness (normalised to range [0,1])terminal_observation_X - terminal observation for obs_X at end of episodeItem suffixies denote the strategy used to the collect the trajectory. Real item suffixes:_origpolicy / _policy - taken with expert policy trained in simulation environment_identif_xxx - taken with fixed process parameters, variable feed rate_bc / _dagger - taken with expert policy adapted with GP force model, trained with BC / DAggerSim item suffixes:rand_baseline - taken with baseline (fixed 1mm depth of cut, 1.5m/min feed rate)rand_dummy - taken with random process parameters, fixed throughout trialrand_policy - taken with expert policyrand_randX - taken with random actions every timestepN.B. Columns obs_0 through obs_12 were used for VAE trainingNOTE: The reward columns in the real folders do not contain meaningful data!policy/This folder contains pickled trajectories, in the form of a Python list.The list's elements are TrajWithRew dataclass objects from the Imitation Python library (https://imitation.readthedocs.io/en/latest/)TrajWithRew contains 4 main fields obs - the (unnormalised) observations, in the form of a [WINDOW_LENGTH * NUM_CHANNELS] array acts - the actions in the form of a [WINDOW_LENGTH - 1 * NUM_ACTS] array infos - the info values at each timestep, as a [WINDOW_LENGTH - 1] array of dicts terminals - boolean indicating if that trajectory segment is a terminal segment rews - the rewards as a [WINDOW_LENGTH - 1] arrayEach TrajWithRew represents not a full episodic trajectory, as is usually the case with Imitiation - rather they represent segments of a full episodic trajectory, of length WINDOW_LENGTH. The observations are of WINDOW_LENGTH, the remaining fields are of length WINDOW_LENGTH - 1. This is to allow a next observation (s') to be given for all transitions (each trajectory can be further decomposed into an array of transitions) which are simply the Markov Decision Process (s,a,r,s') tuples.The filename prefix contains the name of the model used to perform style transfer:st - neural style transfercvae - conditional variational autoencodergan - CycleGANStyle transfer is only carried out on the first 12 observations. The last 5 observations (13:18) are action observations and are left unmodified. These are zeroed during policy re-training to avoid over-fitting. While the observations are "style transferred", the actions are those from the original policy, as rolled out in the simulation environment in which it was trained.<i>rollout_trajectories_x50_denormed</i> contains the raw episodic trajectories from 50 simulation rollouts, containing unnormalised observations, prior to windowing and style transfer.

本仓库包含用于生成变分自编码器（Variational Autoencoder, VAE）模型及风格迁移数据集的源数据集，相关工作为提交至《Nature Scientific Reports》的论文《基于神经风格化的端到端基于示例的仿真到现实强化学习（Reinforcement Learning, RL）策略迁移及其在机器人切割中的应用》。 ### 仓库内容概述 #### vae.zip 该压缩包包含用于训练变分自编码器及其条件变体、CycleGAN生成器与判别器网络的训练集（train）与验证集（val）CSV格式数据。训练集与验证集文件夹均分别划分为仿真（sim）与真实（real）数据集，具体规模如下： - 训练集： - 仿真数据：680个样本 - 真实数据：118个样本 - 验证集： - 仿真数据：20个样本 - 真实数据：30个样本每个样本均包含采用旋转圆盘锯刀具、不同工艺参数选择策略开展接触切割实验得到的回合轨迹。仿真场景下的实验参数为：俯仰角0.126rad、半径25mm、刀具宽度0.0005m、50个切削刃（槽）、主轴转速1000rpm，径向切削深度与进给率可变，包含材料力学常数与几何参数。真实场景下采用相同的实验条件，但切削刃数量为100个，主轴转速为500rpm。每个样本包含以下数据字段： 1. `ee_pos_X`：末端执行器的XYZ位置（以刀具尖端为定位原点） 2. `ee_quat_X`：采用标量优先（WXYZ）格式表示的末端执行器姿态四元数 3. `time_reward`：任务完成前耗时对应的奖励 4. `dev_reward`：与参考路径偏差对应的奖励 5. `productivity_reward`：材料去除体积的启发式奖励 6. `force_reward`：由力扭矩传感器（force-torque sensor）测得的刀具负载对应的奖励 7. `obs_0`：与参考路径平行的速度分量 8. `obs_1~3`：与参考路径的偏差值 9. `obs_4~6`：末端执行器的速度矢量 10. `obs_7~9`：力扭矩传感器测得的切削力分量 11. `obs_10~12`：力扭矩传感器测得的切削扭矩分量 12. `obs_13`：与标称路径的时间偏移（以时间参数化B样条（B-spline）表示） 13. `obs_14`：与标称路径的切削深度偏移 14. `obs_15~17`：操作空间控制刚度 15. `action_0`：进给率调整量（范围为标称进给率的0.1倍至2倍，归一化至[0,1]区间；仿真场景下标称进给率为1.5m/min，真实场景下为0.75m/min） 16. `action_1`：切削深度偏移的时间导数 17. `action_2~4`：操作空间控制刚度（归一化至[0,1]区间） 18. `terminal_observation_X`：回合结束时`obs_X`对应的终端观测值样本后缀代表轨迹采集所用的策略： - 真实数据集样本后缀： - `_origpolicy` / `_policy`：采用在仿真环境中训练的专家策略采集得到 - `_identif_xxx`：采用固定工艺参数、可变进给率采集得到 - `_bc` / `_dagger`：采用经过GP力模型适配的专家策略采集，分别通过行为克隆（Behavior Cloning, BC）与DAgger算法训练得到 - 仿真数据集样本后缀： - `rand_baseline`：采用基线策略采集（固定切削深度1mm，进给率1.5m/min） - `rand_dummy`：采用随机工艺参数且全程固定的策略采集 - `rand_policy`：采用专家策略采集 - `rand_randX`：每一时间步均采用随机动作采集 ##### 注意事项： 1. `obs_0`至`obs_12`字段用于变分自编码器训练 2. 真实数据集文件夹中的奖励字段无有效数据！ --- #### policy文件夹该文件夹包含以Python列表形式存储的序列化轨迹（pickled trajectories）数据，采用Imitation Python库（https://imitation.readthedocs.io/en/latest/）中的`TrajWithRew`数据类（dataclass）对象作为列表元素。`TrajWithRew`包含4个主要字段： 1. `obs`：（未归一化的）观测数据，格式为`[WINDOW_LENGTH * NUM_CHANNELS]`数组 2. `acts`：动作数据，格式为`[WINDOW_LENGTH - 1 * NUM_ACTS]`数组 3. `infos`：每个时间步的信息值，格式为长度为`[WINDOW_LENGTH -1]`的字典数组 4. `terminals`：布尔值，指示该轨迹片段是否为终端片段 5. `rews`：奖励数据，格式为长度为`[WINDOW_LENGTH -1]`的数组需注意，此处的`TrajWithRew`对象并非通常Imitation库中代表完整回合轨迹的结构，而是代表完整回合轨迹的片段，长度为`WINDOW_LENGTH`。观测数据的长度为`WINDOW_LENGTH`，其余字段的长度为`WINDOW_LENGTH -1`，这是为了为所有马尔可夫决策过程（Markov Decision Process, MDP）的`(s,a,r,s')`转移元组提供下一观测值`s'`，即每条轨迹可进一步分解为一系列转移样本。文件名前缀包含用于执行风格迁移的模型名称： - `st`：神经风格迁移模型 - `cvae`：条件变分自编码器模型 - `gan`：CycleGAN模型仅对前12个观测字段执行风格迁移，最后5个观测字段（13:18）为动作相关观测，保持原样未修改。在策略重训练阶段，这些字段会被置零以避免过拟合。尽管观测数据经过了风格迁移，但动作数据仍为原策略在训练所用仿真环境中执行时的原始动作。 *rollout_trajectories_x50_denormed* 包含来自50次仿真推演的原始回合轨迹，这些轨迹在分窗与风格迁移前已完成去归一化处理，包含未归一化的观测数据。

提供机构：

figshare

创建时间：

2025-05-09