Publications
2024
- ACM MM’24, OralFreePIH: Training-Free Painterly Image Harmonization with Diffusion ModelRuibin Li, Jingcai Guo, Qihua Zhou, and Song GuoIn Proceedings of the ACM International Conference on Multimedia (ACM-MM, Oral, CCF-A) , Melbourne, Australia, Jul 2024
This paper provides an efficient training-free painterly image harmonization (PIH) method, dubbed FreePIH, that leverages only a pre-trained diffusion model to achieve state-of-the-art harmonization results. Unlike existing methods that require either training auxiliary networks or fine-tuning a large pre-trained backbone, or both, to harmonize a foreground object with a painterly-style background image, our FreePIH tames the denoising process as a plug-in module for foreground image style transfer. Specifically, we find that the very last few steps of the denoising (i.e., generation) process strongly correspond to the stylistic information of images, and based on this, we propose to augment the latent features of both the foreground and background images with Gaussians for a direct denoising-based harmonization. To guarantee the fidelity of the harmonized image, we make use of multi-scale features to enforce the consistency of the content and stability of the foreground objects in the latent space, and meanwhile, aligning both fore-/back-grounds with the same style. Moreover, to accommodate the generation with more structural and textural details, we further integrate text prompts to attend to the latent features, hence improving the generation quality. Quantitative and qualitative evaluations on COCO and LAION 5B datasets demonstrate that our method can surpass representative baselines by large margins.
- IJCAI’24ParsNets: A Parsimonious Orthogonal and Low-Rank Linear Networks for Zero-Shot LearningJingcai Guo, Qihua Zhou, Ruibing Li, Xiaocheng Lu, Ziming Liu, Junyang Chen, Xin Xie, and Jie ZhangIn Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI, CCF-A) , Jeju, Apr 2024
This paper provides a novel parsimonious yet efficient design for zero-shot learning (ZSL), dubbed ParsNets, in which we are interested in learning a composition of on-device friendly linear networks, each with orthogonality and low-rankness properties, to achieve equivalent or better performance against deep models. Concretely, we first refactor the core module of ZSL, i.e., the visual-semantics mapping function, into several base linear networks that correspond to diverse components of the semantic space, wherein the complex nonlinearity can be collapsed into simple local linearities. Then, to facilitate the generalization of local linearities, we construct a maximal margin geometry on the learned features by enforcing low-rank constraints on intra-class samples and high-rank constraints on inter-class samples, resulting in orthogonal subspaces for different classes. To enhance the model’s adaptability and counterbalance the over-/under-fittings, a set of sample-wise indicators is employed to select a sparse subset from these base linear networks to form a composite semantic predictor for each sample. Notably, maximal margin geometry can guarantee the diversity of features and, meanwhile, local linearities guarantee efficiency. Thus, our ParsNets can generalize better to unseen classes and can be deployed flexibly on resource-constrained devices.
- IEEE TPAMI’24PASS: Patch Automatic Skip Scheme for Efficient On-Device Video PerceptionQihua Zhou, Song Guo, Jun Pan, Jiacheng Liang, Jingcai Guo, Zhenda Xu, and Jingren ZhouIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, CCF-A, IF=23.6), Jan 2024
Real-time video perception tasks are often challenging on resource-constrained edge devices due to the issues of accuracy drop and hardware overhead, where saving computations is the key to performance improvement. Existing methods either rely on domain-specific neural chips or priorly searched models, which require specialized optimization according to different task properties. These limitations motivate us to design a general and task-independent methodology, called Patch Automatic Skip Scheme (PASS), which supports diverse video perception settings by decoupling acceleration and tasks. The gist is to capture inter-frame correlations and skip redundant computations at patch level, where the patch is a non-overlapping square block in visual. PASS equips each convolution layer with a learnable gate to selectively determine which patches could be safely skipped without degrading model accuracy. Specifically, we are the first to construct a self-supervisory procedure for gate optimization, which learns to extract contrastive representations from frame sequences. The pre-trained gates can serve as plug-and-play modules to implement patch-skippable neural backbones, and automatically generate proper skip strategy to accelerate different video-based downstream tasks, e.g., outperforming state-of-the-art MobileHumanPose in 3D pose estimation and FairMOT in multiple object tracking, by up to 9.43× and 12.19× speedups, respectively, on NVIDIA Jetson Nano devices.
- IEEE TMC’24Chiron: A Robustness-Aware Incentive Scheme for Edge Learning Via Hierarchical Reinforcement LearningYi Liu, Song Guo, Yufeng Zhan, Leijie Wu, Zicong Hong, and Qihua ZhouIEEE Transactions on Mobile Computing (TMC, CCF-A, IF=7.9), Jan 2024
Over the past few years, edge learning has achieved significant success in mobile edge networks. Few works have designed incentive mechanism that motivates edge nodes to participate in edge learning. However, most existing works only consider myopic optimization and assume that all edge nodes are honest, which lacks long-term sustainability and the final performance assurance. In this paper, we propose Chiron, an incentive-driven Byzantine-resistant long-term mechanism based on hierarchical reinforcement learning (HRL). First, our optimization goal includes both learning-algorithm performance criteria (i.e., global accuracy) and systematical criteria (i.e., resource consumption), which aim to improve the edge learning performance under a given resource budget. Second, we propose a three-layer HRL architecture to handle long-term optimization, short-term optimization, and byzantine resistance, respectively. Finally, we conduct experiments on various edge learning tasks to demonstrate the superiority of the proposed approach. Specifically, our system can successfully exclude malicious nodes and lazy nodes out of the edge learning participation and achieves 14.96% higher accuracy and 12.66% higher total utility than the state-of-the-art methods under the same budget limit.
- AAAI’24On the Robustness of Neural-Enhanced Video Streaming against Adversarial AttacksQihua Zhou, Jingcai Guo, Song Guo, Ruibin Li, Jie Zhang, Bingjie Wang, and Zhenda XuIn Proceedings of the AAAI Conference on Artificial Intelligence (AAAI, CCF-A) , Vancouver, Canada, Feb 2024
The explosive growth of video traffic on today’s Internet promotes the rise of Neural-enhanced Video Streaming (NeVS), which effectively improves the rate-distortion trade-off by employing a cheap neural super-resolution model for quality enhancement on the receiver side. Missing by existing work, we reveal that the NeVS pipeline may suffer from a practical threat, where the crucial codec component (i.e., encoder for compression and decoder for restoration) can trigger adversarial attacks in a man-in-the-middle manner to significantly destroy video recovery performance and finally incurs the malfunction of downstream video perception tasks. In this paper, we are the first attempt to inspect the vulnerability of NeVS and discover a novel adversarial attack, called codec hijacking, where the injected invisible perturbation conspires with the malicious encoding matrix by reorganizing the spatial-temporal bit allocation within the bitstream size budget. Such a zero-day vulnerability makes our attack hard to defend because there is no visual distortion on the recovered videos until the attack happens. More seriously, this attack can be extended to diverse enhancement models, thus exposing a wide range of video perception tasks under threat. Evaluation based on state-of-the-art video codec benchmark illustrates that our attack significantly degrades the recovery performance of NeVS over previous attack methods. The damaged video quality finally leads to obvious malfunction of downstream tasks with over 75% success rate. We hope to arouse public attention on codec hijacking and its defence.
2023
- IEEE TMC’23Tree Learning: Towards Promoting Coordination in Scalable Multi-Client Training AccelerationTao Guo, Song Guo, Feijie Wu, Wenchao Xu, Jiewei Zhang, Qihua Zhou, Quan Chen, and Weihua ZhuangIEEE Transactions on Mobile Computing (TMC, CCF-A, IF=7.9), Mar 2023
Iteration based collaborative learning (CL) paradigms, such as federated learning (FL) and split learning (SL), faces challenges in training neural models over the rapidly growing yet resource-constrained edge devices. Such devices have difficulty in accommodating a full-size large model for FL or affording an excessive waiting time for the mandatory synchronization step in SL. To deal with such challenge, we propose a novel CL framework which adopts an tree-aggregation structure with an adaptive partition and ensemble strategy to achieve optimal synchronization and fast convergence at scale. To find the optimal split point for heterogeneous clients, we also design a novel partitioning algorithm by minimizing the idleness during communication and achieving the optimal synchronization between clients. In addition, a parallelism paradigm is proposed to unleash the potential of optimum synchronization between the clients and server to boost the distributed training process without losing model accuracy for edge devices. Furthermore, we theoretically prove that our framework can achieve better convergence rate than state-of-the-art CL paradigms. We conduct extensive experiments and show that our framework is 4.6× in training speed as compared with the traditional methods, without compromising training accuracy.
- AAAI’23PASS: Patch Automatic Skip Scheme for Efficient Real-Time Video Perception on Edge DevicesQihua Zhou, Song Guo, Jun Pan, Jiacheng Liang, Zhenda Xu, and Jingren ZhouIn Proceedings of the AAAI Conference on Artificial Intelligence (AAAI, CCF-A) , Washington, DC, USA, Feb 2023
Real-time video perception tasks are often challenging over the resource-constrained edge devices due to the concerns of accuracy drop and hardware overhead, where saving computations is the key to performance improvement. Existing methods either rely on domain-specific neural chips or priorly searched models, which require specialized optimization according to different task properties. In this work, we propose a general and task-independent Patch Automatic Skip Scheme (PASS), a novel end-to-end learning pipeline to support diverse video perception settings by decoupling acceleration and tasks. The gist is to capture the temporal similarity across video frames and skip the redundant computations at patch level, where patch is a non-overlapping square block in visual. PASS equips each convolution layer with a learnable gate to selectively determine which patches could be safely skipped without degrading model accuracy. As to each layer, a desired gate needs to make flexible skip decisions based on intermediate features without any annotations, which cannot be achieved by conventional supervised learning paradigm. To address this challenge, we are the first to construct a tough self-supervisory procedure for optimizing these gates, which learns to extract contrastive representation, i.e., distinguishing similarity and difference, from frame sequence. These high-capacity gates can serve as a plug-and-play module for convolutional neural network (CNN) backbones to implement patch-skippable architectures, and automatically generate proper skip strategy to accelerate different video-based downstream tasks, e.g., outperforming the state-of-the-art MobileHumanPose (MHP) in 3D pose estimation and FairMOT in multiple object tracking, by up to 9.43 times and 12.19 times speedups, respectively. By directly processing the raw data of frames, PASS can generalize to real-time video streams on commodity edge devices, e.g., NVIDIA Jetson Nano and mobile phones, and achieves efficient performance in realistic deployment.
- AAAI’23Graph Knows Unknowns: Reformulate Zero-Shot Learning as Sample-Level Graph RecognitionJingcai Guo, Song Guo, Qihua Zhou, Ziming Liu, Xiaocheng Lu, and Fushuo HuoIn Proceedings of the AAAI Conference on Artificial Intelligence (AAAI, CCF-A) , Washington, DC, USA, Feb 2023
Zero-shot learning (ZSL) is an extreme case of transfer learning that aims to recognize samples (e.g., images) of unseen classes relying on a train-set covering only seen classes and a set of auxiliary knowledge (e.g., semantic descriptors). Existing methods usually resort to constructing a visual-to-semantics mapping based on features extracted from each whole sample. However, since the visual and semantic spaces are inherently independent and may exist in different manifolds, these methods may easily suffer from the domain bias problem due to the knowledge transfer from seen to unseen classes. Unlike existing works, this paper investigates the fine-grained ZSL from a novel perspective of sample-level graph. Specifically, we decompose an input into several fine-grained elements and construct a graph structure per sample to measure and utilize element-granularity relations within each sample. Taking advantage of recently developed graph neural networks (GNNs), we formulate the ZSL problem to a graph-to-semantics mapping task, which can better exploit element-semantics correlation and local sub-structural information in samples. Experimental results on the widely used benchmark datasets demonstrate that the proposed method can mitigate the domain bias problem and achieve competitive performance against other representative methods.
2022
- NeurIPS’22Hierarchical Channel-spatial Encoding for Communication-efficient Collaborative LearningQihua Zhou, Song Guo, Yi Liu, Jie Zhang, Jiewei Zhang, Tao Guo, Zhenda Xu, Xun Liu, and Zhihao QuIn Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS, CCF-A) , New Orleans, USA, Nov 2022
It witnesses that the collaborative learning (CL) systems often face the performance bottleneck of limited bandwidth, where multiple low-end devices continuously generate data and transmit intermediate features to the cloud for incremental training. To this end, improving the communication efficiency by reducing traffic size is one of the most crucial issues for realistic deployment. Existing systems mostly compress features at pixel level and ignore the characteristics of feature structure, which could be further exploited for more efficient compression. In this paper, we take new insights into implementing scalable CL systems through a hierarchical compression on features, termed Stripe-wise Group Quantization (SGQ). Different from previous unstructured quantization methods, SGQ captures both channel and spatial similarity in pixels, and simultaneously encodes features in these two levels to gain a much higher compression ratio. In particular, we refactor feature structure based on inter-channel similarity and bound the gradient deviation caused by quantization, in forward and backward passes, respectively. Such a double-stage pipeline makes SGQ hold a sublinear convergence order as the vanilla SGD-based optimization. Extensive experiments show that SGQ achieves a higher traffic reduction ratio by up to 15.97 times and provides 9.22 times image processing speedup over the uniform quantized training, while preserving adequate model accuracy as FP32 does, even using 4-bit quantization. This verifies that SGQ can be applied to a wide spectrum of edge intelligence applications.
2021
- IEEE TC SpotlightA Comprehensive Inspection of the Straggler ProblemQihua Zhou, Song Guo, Haodong Lu, Li Li, Minyi Guo, Yanfei Sun, and Kun WangThe Spotlight of IEEE Transactions on Computers, Sep 2021
Parameter server is a popular distributed processing paradigm for operating distributed deep learning (DL) applications. As a growing number of DL models are trained via shared clusters, machines are in confrontation with the heterogeneous environment, which incurs the unexpected phenomenon with a slow task processing speed called straggler. Straggler addressing is a crucial issue in distributed DL applications, since stragglers significantly hamper system performance. While many techniques have been deployed to mitigate stragglers, they may not achieve their goals with the presence of heterogeneity, where systems consume much longer time until DL training convergence than in a homogeneous environment, as evidenced by our experimental study. With the methodology of straggler projection and abstraction of parallelism, a new synchronization mechanism called elastic parallelism synchronous parallel (EPSP) is proposed, which exploits the superiority of iteration acceleration in stale synchronous parallel and conquers the shortage of barrier wasting time in bulk synchronous parallel. More precisely, EPSP supports both enforced and slack synchronization by adjusting the parameter of staleness.
- IEEE IoTJ’21On-Device Learning Systems for Edge Intelligence: A Software and Hardware Synergy PerspectiveQihua Zhou, Zhihao Qu, Song Guo, Boyuan Luo, Jingcai Guo, Zhenda Xu, and Rajendra AkerkarIEEE Internet of Things Journal (IoTJ, JCR-Q1, IF=10.6), Aug 2021
Modern machine learning (ML) applications are often deployed in the cloud environment to exploit the computational power of clusters. However, this in-cloud computing scheme cannot satisfy the demands of emerging edge intelligence scenarios, including providing personalized models, protecting user privacy, adapting to real-time tasks, and saving resource cost. In order to conquer the limitations of conventional in-cloud computing, there comes the rise of on-device learning, which makes the end-to-end ML procedure totally on user devices, without unnecessary involvement of the cloud. In spite of the promising advantages of on-device learning, implementing a high-performance on-device learning system still faces with many severe challenges, such as insufficient user training data, backward propagation (BP) blocking, and limited peak processing speed. Observing the substantial improvement space in the implementation and acceleration of on-device learning systems, we intend to present a comprehensive analysis of the latest research progress and point out potential optimization directions from the system perspective. This survey presents a software and hardware synergy of on-device learning techniques, covering the scope of model-level neural network design, algorithm-level training optimization, and hardware-level instruction acceleration. We hope this survey could bring fruitful discussions and inspire the researchers to further promote the field of edge intelligence.
- USENIX ATC’21Octo: INT8 Training with Loss-aware Compensation and Backward Quantization for Tiny On-device LearningQihua Zhou, Song Guo, Zhihao Qu, Jingcai Guo, Zhenda Xu, Jiewei Zhang, Tao Guo, Boyuan Luo, and Jingren ZhouIn Proceedings of the USENIX Annual Technical Conference (USENIX ATC, CCF-A) , Virtual Event, Jul 2021
On-device learning is an emerging technique to pave the last mile of enabling edge intelligence, which eliminates the limitations of conventional in-cloud computing where dozens of computational capacities and memories are needed. A high-performance on-device learning system requires breaking the constraints of limited resources and alleviating computational overhead. In this paper, we show that employing the 8-bit fixed-point (INT8) quantization in both forward and backward passes over a deep model is a promising way to enable tiny on-device learning in practice. The key to an efficient quantization-aware training method is to exploit the hardware-level enabled acceleration while preserving the training quality in each layer. However, off-the-shelf quantization methods cannot handle the on-device learning paradigm of fixed-point processing. To overcome these challenges, we propose a novel INT8 training method, which optimizes the computation of forward and backward passes via the delicately designed Loss-aware Compensation (LAC) and Parameterized Range Clipping (PRC), respectively. Specifically, we build a new network component, the compensation layer, to automatically counteract the quantization error of tensor arithmetic. We implement our method in Octo, a lightweight cross-platform system for tiny on-device learning. Evaluation on commercial AI chips shows that Octo holds higher training efficiency over state-of-the-art quantization training methods, while achieving adequate processing speedup and memory reduction over the full-precision training.
- IEEE TPDS’21Petrel: Heterogeneity-Aware Distributed Deep Learning Via Hybrid SynchronizationQihua Zhou, Song Guo, Zhihao Qu, Peng Li, Li Li, Minyi Guo, and Kun WangIEEE Transactions on Parallel and Distributed Systems (TPDS, CCF-A, IF=5.3), May 2021
The parameter server (PS) paradigm has achieved great success in deploying large-scale distributed Deep Learning (DL) systems. However, these systems implicitly assume that the cluster is homogeneous and this assumption does not hold in many realworld cases. Although the previous efforts are paid to address heterogeneity, they mainly prioritize the contribution of fast workers and reduce the involvement of slow workers, resulting in the limitations of workload imbalance and computation inefficiency. We reveal that grouping workers into communities, an abstraction proposed by us, and handling parameter synchronization at the community level can conquer these limitations and accelerate the training convergence progress. The inspiration of community comes from our exploration of prior knowledge about the similarity between workers, which is often neglected by previous work. These observations motivate us to propose a new synchronization mechanism named Community-aware Synchronous Parallel (CASP), which uses the Asynchronous Advantage Actor-Critic (A3C)-based algorithm to intelligently determine community configuration and fully improve the synchronization performance. The whole idea has been implemented in a prototype system called Petrel that achieves a good balance between convergence efficiency and communication overhead. The evaluation under various benchmarks with multiple metrics and baseline comparison demonstrates the effectiveness of Petrel. Specifically, Petrel accelerates the training convergence speed by up to 1.87× faster and reduces communication traffic by up to 26.85 percent, on average, over the non-community synchronization mechanisms.
- IEEE TPDS’21Canary: Decentralized Distributed Deep Learning Via Gradient Sketch and Partition in Multi-Interface NetworksQihua Zhou, Kun Wang, Haodong Lu, Wenyao Xu, Yanfei Sun, and Song GuoIEEE Transactions on Parallel and Distributed Systems (TPDS, CCF-A, IF=5.3), Apr 2021
The multi-interface networks are efficient infrastructures to deploy distributed Deep Learning (DL) tasks as the model gradients generated by each worker can be exchanged to others via different links in parallel. Although this decentralized parameter synchronization mechanism can reduce the time of gradient exchange, building a high-performance distributed DL architecture still requires the balance of communication efficiency and computational utilization, i.e., addressing the issues of traffic burst, data consistency, and programming convenience. To achieve this goal, we intend to asynchronously exchange gradient pieces without the central control in multi-interface networks. We propose the Piece-level Gradient Exchange and Multi-interface Collective Communication to handle parameter synchronization and traffic transmission, respectively. Specifically, we design the gradient sketch approach based on 8-bit uniform quantization to compress gradient tensors and introduce the colayerabstraction to better handle gradient partition, exchange and pipelining. Also, we provide general programming interfaces to capture the synchronization semantics and build the Gradient Exchange Index (GEI) data structures to make our approach online applicable. We implement our algorithms into a prototype system called Canary by using PyTorch-1.4.0. Experiments conducted in Alibaba Cloud demonstrate that Canary reduces 56.28 percent traffic on average and completes the training by up to 1.61x, 2.28×, and 2.84× faster than BML, Ako on PyTorch, and PS on TensorFlow, respectively.
- IEEE TC’21Falcon: Addressing Stragglers in Heterogeneous Parameter Server Via Multiple ParallelismQihua Zhou, Song Guo, Haodong Lu, Li Li, Minyi Guo, Yanfei Sun, and Kun WangIEEE Transactions on Computers (TC, CCF-A, IF=3.7), Jan 2021
The parameter server architecture has shown promising performance advantages when handling deep learning (DL) applications. One crucial issue in this regard is the presence of stragglers, which significantly retards DL training progress. Previous solutions for solving stragglers may not fully exploit the computation resource of the cluster as evidenced by our experiments, especially in the heterogeneous environment. This motivates us to design a heterogeneity-aware parameter server paradigm that addresses stragglers and accelerates DL training from the perspective of computation parallelism. We introduce a novel methodology named straggler projection to give a comprehensive inspection of stragglers and reveal practical guidelines to solve this problem in two aspects: (1) controlling each worker’s training speed via elastic training parallelism control and (2) transferring blocked tasks from stragglers to pioneers to fully utilize the computation resource. Following these guidelines, we propose the abstraction of parallelism as an infrastructure and design the Elastic-Parallelism Synchronous Parallel (EPSP) algorithm to handle distributed training and parameter synchronization, supporting both enforcedand slack-synchronization schemes. The whole idea has been implemented into a prototype called Falcon which effectively accelerates the DL training speed with the presence of stragglers. Evaluation under various benchmarks with baseline comparison demonstrates the superiority of our system. Specifically, Falcon reduces the training convergence time, by up to 61.83, 55.19, 38.92, and 23.68 percent shorter than FlexRR, Sync-opt, ConSGD, and DynSGD, respectively.
2020
- IEEE MPCE’20Learning-based Green Workload Placement for Energy Internet in Smart CitiesQihua Zhou, Yanfei Sun, Haodong Lu, and Kun WangIEEE Journal of Modern Power Systems and Clean Energy (MPCE, JCR-Q1, IF=6.3), Dec 2020
The Energy Internet is a fundamental infrastructure for deploying green city applications, where energy saving and job acceleration are two critical issues to address. In contrast to existing approaches that focus on static metrics with the assumption of complete prior knowledge of resource information, both application-level properties and energy-level requirements are realized in this paper by jointly considering energy saving and job acceleration during job runtime. Considering the online environment of smart city applications, the main objective is transferred as an optimization problem with a model partition and function assignment. To minimize the energy cost and job completion time together, a green workload placement approach is proposed by using the multi-action deep reinforcement learning method. Evaluations with real-world applications demonstrate the superiority of this method over state-of-the-art methods.
- IEEE ICDCS’20Petrel: Community-aware Synchronous Parallel for Heterogeneous Parameter ServerQihua Zhou, Song Guo, Peng Li, Yanfei Sun, Li Li, Minyi Guo, and Kun WangIn Proceedings of the 40th IEEE International Conference on Distributed Computing Systems (ICDCS, CCF-B) , Singapore, Nov 2020
As to address the impact of heterogeneity in distributed Deep Learning (DL) systems, most previous approaches focus on prioritizing the contribution of fast workers and reducing the involvement of slow workers, incurring the limitations of workload imbalance and computation inefficiency. We reveal that grouping workers into communities, an abstraction proposed by us, and handling parameter synchronization in community level can conquer these limitations and accelerate the training convergence progress. The inspiration of community comes from our exploration of prior knowledge about the similarity between workers, which is often neglected by previous work. These observations motivate us to propose a new synchronization mechanism named Community-aware Synchronous Parallel (CSP), which uses the Asynchronous Advantage Actor-Critic (A3C), a Reinforcement Learning (RL) based algorithm, to intelligently determine community configuration and fully improve the synchronization performance. The whole idea has been implemented in a system called Petrel that achieves a good balance between convergence efficiency and communication overhead. The evaluation under different benchmarks demonstrates our approach can effectively accelerate the training convergence speed and reduce synchro-nization traffic.
- ACM MM’20Dual-view Attention Networks for Single Image Super-ResolutionJingcai Guo, Shiheng Ma, Jie Zhang, Qihua Zhou, and Song GuoIn Proceedings of the 28th ACM International Conference on Multimedia (ACM MM, CCF-A) , Seattle, USA, Oct 2020
One non-negligible flaw of the convolutional neural networks (CNNs) based single image super-resolution (SISR) models is that most of them are not able to restore high-resolution (HR) images containing sufficient high-frequency information. Worse still, as the depth of CNNs increases, the training easily suffers from the vanishing gradients. These problems hinder the effectiveness of CNNs in SISR. In this paper, we propose the Dual-view Attention Networks to alleviate these problems for SISR. Specifically, we propose the local aware (LA) and global aware (GA) attentions to deal with LR features in unequal manners, which can highlight the high-frequency components and discriminate each feature from LR images in the local and global views, respectively. Furthermore, the local attentive residual-dense (LARD) block that combines the LA attention with multiple residual and dense connections is proposed to fit a deeper yet easy to train architecture. The experimental results verified the effectiveness of our model compared with other state-of-the-art methods.
2019
- IEEE TC’19Fast Coflow Scheduling via Traffic Compression and Stage Pipelining in Datacenter NetworksQihua Zhou, Kun Wang, Peng Li, Deze Zeng, Song Guo, Baoliu Ye, and Minyi GuoIEEE Transactions on Computers (TC, CCF-A, IF=3.7), Dec 2019
Big data analytics in datacenters often involve scheduling of data-parallel jobs. Traditional scheduling techniques based on improving network resource utilization are subject to limited bandwidth in datacenter networks. To alleviate the shortage of bandwidth, some cluster frameworks employ techniques of traffic compression to reduce transmission consumption. However, they tackle scheduling in a coarse-grained manner at task level and do not perform well in terms of flow-level metrics due to high complexity. Fortunately, the abstraction of coflow pioneers a new perspective to facilitate scheduling efficiency. In this paper, we introduce a coflow compression mechanism to minimize the completion time in data-intensive applications. Due to the NP-hardness, we propose a heuristic algorithm called Fastest-Volume-Disposal-First (FVDF) to solve this problem. For online applicability, FVDF supports stage pipelining to accelerate scheduling and exploits recurrent neural networks (RNNs) to predict compression speed. Meanwhile, we build Swallow, an efficient scheduling system that implements our proposed algorithms. It minimizes coflow completion time (CCT) while guaranteeing resource conservation and starvation freedom. The results of both trace-driven simulations and real experiments show the superiority of our algorithm, over existing one. Specifically, Swallow speeds up CCT and job completion time (JCT) by up to 1.47× and 1.66× on average, respectively, over the SEBF in Varys, one of the most efficient coflow scheduling algorithms so far. Moreover, with coflow compression, Swallow reduces data traffic by up to 48.41 percent on average.
- IEEE ICDCS’19Falcon: Towards Computation-Parallel Deep Learning in Heterogeneous Parameter ServerQihua Zhou, Kun Wang, Song Guo, Haodong Lu, Li Li, Minyi Guo, and Yanfei SunIn Proceedings of the 39th IEEE International Conference on Distributed Computing Systems (ICDCS, CCF-B) , Dallas, USA, Jul 2019
Parameter server paradigm has shown great performance superiority for handling deep learning (DL) applications. One crucial issue in this regard is the presence of stragglers, which significantly retards DL training progress. Previous solutions for solving straggler may not fully exploit the computation capacity of a cluster as evidenced by our experiments. This motivates us to make an attempt at building a new parameter server architecture that mitigates and addresses stragglers in heterogeneous DL from the perspective of computation parallelism. We introduce a novel methodology named straggler projection to give a comprehensive inspection of stragglers and reveal practical guidelines for resolving this problem: (1) reducing straggler emergence frequency via elastic parallelism control and (2) transferring blocked tasks to pioneer workers for fully exploiting cluster computation capacity. Following the guidelines, we propose the abstraction of parallelism as an infrastructure and elaborate the Elastic-Parallelism Synchronous Parallel (EPSP) that supports both enforced-and slack-synchronization schemes. The whole idea has been implemented in a prototype called Falcon which efficiently accelerates the DL training progress with the presence of stragglers. Evaluation under various benchmarks with baseline comparison evidences the superiority of our system. Specifically, Falcon yields shorter convergence time, by up to 61.83%, 55.19%, 38.92% and 23.68% reduction over FlexRR, Sync-opt, ConSGD and DynSGD, respectively.
2018
- IEEE COMST’18Cluster Frameworks for Efficient Scheduling and Resource Allocation in Data Center Networks: A SurveyKun Wang, Qihua Zhou, Song Guo, and Jiangtao LuoIEEE Communications Surveys & Tutorials (COMST, JCR-Q1, IF=35.6), Jul 2018
Data centers are widely used for big data analytics, which often involve data-parallel jobs, including query and web service. Meanwhile, cluster frameworks are rapidly developed for data-intensive applications in data center networks (DCNs). To promote the performance of these frameworks, many efforts have been paid to improve scheduling strategies and resource allocation algorithms. With the deployment of geo-distributed data centers and data-intensive applications, the optimization in DCNs regains pervasive attention in both industry and academia. Many solutions, such as the coflow-aware scheduling and speculative execution, have been proposed to meet various requirements. Therefore, we present a solid starting ground and comprehensive overview in this area to help readers quickly understand stateof-the-art technologies and research progress. We observe that algorithms in cluster frameworks are implemented with different guidelines and can be classified according to scheduling granularity, controller management, and prior-knowledge requirement. In addition, mechanisms for conquering crucial challenges in DCNs are discussed, including providing low latency and minimizing job completion time. Moreover, we analyze desirable properties of fault tolerance and scalability to illuminate the design principles of distributed systems. We hope that this paper will shed light on this promising land and serve as a guide for further researches.
- IEEE IPDPS’18Swallow: Joint Online Scheduling and Coflow Compression in Datacenter NetworksQihua Zhou, Peng Li, Kun Wang, Deze Zeng, Song Guo, and Minyi GuoIn Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS, CCF-B) , Vancouver, Canada, May 2018
Big data analytics in datacenters often involves scheduling of data-parallel job, which are bottlenecked by limited bandwidth of datacenter networks. To alleviate the shortage of bandwidth, some existing work has proposed traffic compression to reduce the amount of data transmitted over the network. However, their proposed traffic compression works in a coarse-grained manner at job level, leaving a large optimization space unexplored for further performance improvement. In this paper, we propose a flow-level traffic compression and scheduling system, called Swallow, to accelerate data-intensive applications. Specifically, we target on coflows, which is an elegant abstraction of parallel flows generated by big data jobs. With the objective of minimizing coflow completion time (CCT), we propose a heuristic algorithm called Fastest-Volume-Disposal-First (FVDV) and implement Swallow based on Spark. The results of both trace-driven simulations and real experiments show the superiority of our system, over existing algorithms. Swallow can reduce CCT and job completion time (JCT) by up to 1.47× and 1.66× on average, respectively, over the SEBF in Varys, one of the most efficient coflow scheduling algorithms so far. Moreover, with coflow compression, Swallow reduces data traffic by up to 48.41% on average.