MDSplusML Project Progress and Revised Plan
收藏DataONE2025-09-22 更新2025-11-01 收录
下载链接:
https://search.dataone.org/view/sha256:915120b38cf4d0ba5e44b28928f111cacc7b82797b64c2e5c3f0f555375ee9a6
下载链接
链接失效反馈官方服务:
资源简介:
The MDSplusML project set out to modernize fusion-experiment data access by improving performance, usability, and compliance with FAIR (Findable, Accessible, Interoperable, Reusable) data principles. Initial benchmarks on our on-premises systems compared the legacy distributed-client, thin-client, and direct HDF5/HSDS access methods using a representative machine-learning workload of ten thousand shots. We discovered that network transaction latency—not expression-evaluation complexity—dominated data retrieval times. Adopting the thin-client protocol reduced a multi-hour bulk-download to tens of minutes, and raw HDF5 reads matched local-disk speeds, while HSDS underperformed, motivating ongoing optimization. Guided by these findings, we have refined our roadmap. On-site users will employ enhanced thin-client APIs (getMany, getManyMany) to batch requests efficiently. For cloud distribution, we propose a “frozen-signals” service: precomputed, read-only snapshots of user-tagged data that prune unnecessary nodes to control file size and eliminate run-time evaluation overhead. Prototypes in Python and C demonstrate that, once downloaded from S3, analyses on frozen HDF5 proceed at native speed, irrespective of network latency. Finally, to support cross-machine collaboration, we are aligning our curated datasets with IMAS standards and evaluating repository platforms such as InvenioRDM for scalable, FAIR-compliant archiving. These efforts establish a high-performance, user-centered MDSplus ecosystem that meets both current and emerging needs in fusion data science.
创建时间:
2025-10-28



