Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
收藏Mendeley Data2024-01-31 更新2024-06-29 收录
下载链接:
https://digitallibrary.usc.edu/asset-management/2A3BF1W4488I
下载链接
链接失效反馈官方服务:
资源简介:
In this dissertation, we introduce the concept of ""coded computing"", which is a novel distributed computing paradigm that utilizes coding theory to smartly inject and leverage data/computation redundancy into distributed computing systems, achieving optimal tradeoffs between key resources like computation, bandwidth and storage, and significantly mitigating the fundamental performance bottlenecks for running large-scale data analytics. ❧ We consider the widely used MapReduce distributed computing framework, for which the input files distributedly stored on computing nodes are first mapped into intermediate values, then the computed intermediate values are shuffled between the nodes to be reduced to final output results. For such a framework, we characterize a fundamental inversely proportional tradeoff between computation load in the Map phase and the communication load in the Shuffle phase. We propose a coded scheme, named ""coded distributed computing'' (CDC), to achieve this tradeoff. CDC performs redundant Map computations across r nodes following a particular structure, enabling coding opportunities to create coded packets that are simultaneously useful for r nodes, and hence reducing the communication load by r times. For computation tasks with particular algebraic structures, e.g., multi-stage computations and computations with linear aggregations, we further optimize the CDC scheme to achieve additional reduction in the bandwidth consumption. We illustrate the practical impact of CDC by developing and evaluating a novel distributed sorting algorithm, named CodedTeraSort, by integrating the coding techniques of CDC into a commonly used Hadoop sorting benchmark TeraSort. On Amazon EC2 clusters, we empirically demonstrate a 3.39x speedup over the benchmark, by optimally trading extra computations for bandwidth consumption using CodedTeraSort. ❧ Beyond cloud computing systems, we extend the idea of CDC into the mobile edge computing environment, where many resource-poor mobile users scattered at the network edge collaborate to meet their computational needs by locally processing partial data and shuffling intermediate computation results via a wireless access point. For this edge computing model, we apply the CDC scheme to develop a scalable wireless distributed computing platform that can accommodate an unlimited number of mobile users with constant bandwidth requirement, via communicating coded packets that are simultaneously useful for multiple users. This platform provides a promising solution to enable efficient collaborative edge/fog computing for Internet-of-Things (IoT) devices, by substantially alleviating the heavy load of communication between a huge number of devices exerted on the underlying wireless channels with limited bandwidth. For another mobile edge computing model in which mobile users offload their computation tasks to the edge servers and receive the computation results from the servers, we propose a universal coded edge computing (UCEC) architecture to simultaneously minimize the load of computation at the edge servers, and maximize the physical-layer communication efficiency towards the mobile users. In the proposed UCEC architecture, edge servers create coded inputs of the users, from which they compute coded output results. Then, the edge servers utilize the computed coded results to create communication messages that zero-force all the interference signals over the air at each user. Specifically, the proposed scheme is universal since the coded computations performed at the edge nodes are oblivious of the channel states during the communication process from the edge nodes to the users. ❧ Finally, motivated by the idea of utilizing error correcting codes to handle the performance bottleneck caused by straggling servers, whose computation and/or communication is much slower than the other servers, we study the problem of designing optimal coding schemes to speed up MapReduce jobs run over heterogeneous computing clusters, in which some of the servers may be stragglers. For this setting, we propose a unified coding scheme that organically superimposes the proposed CDC scheme on top of Maximum-Distance-Separable (MDS) codes on individual tasks, allowing a flexible tradeoff between the computation latency limited by the stragglers in the Map phase and the communication load between the surviving nodes in the Shuffle phase.
创建时间:
2024-01-31



