Publications | Qihua Zhou (周祺華)

2025

MobiCom’25
D2MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving

Haodong Wang, Qihua Zhou, Zicong Hong, and Song Guo

In The 31th Annual International Conference on Mobile Computing and Networking (MobiCom), Hong Kong , 2025

Abs Bib PDF

The mixture of experts (MoE) model is a sparse variant of large language models (LLMs), designed to hold a better balance between intelligent capability and computational overhead. Despite its benefits, MoE is still too expensive to deploy on resource-constrained edge devices, especially with the demands of on-device inference services. Recent research efforts often apply model compression techniques, such as quantization, pruning and merging, to restrict MoE complexity. Unfortunately, due to their predefined static model optimization strategies, they cannot always achieve the desired quality-overhead trade-off when handling multiple requests, finally degrading the on-device quality of service. These limitations motivate us to propose the D2MoE, an algorithm-system co-design framework that matches diverse task requirements by dynamically allocating the most proper bit-width to each expert. Specifically, inspired by the nested structure of matryoshka dolls, we propose the matryoshka weight quantization (MWQ) to progressively compress expert weights in a bit-nested manner and reduce the required runtime memory. On top of it, we further optimize the I/O-computation pipeline and design a heuristic scheduling algorithm following our hottest-expert-bit-first (HEBF) principle, which maximizes the expert parallelism between I/O and computation queue under constrained memory budgets, thus significantly reducing the idle temporal bubbles waiting for the experts to load. Evaluations on real edge devices show that D2MoE improves the overall inference throughput by up to 1.39X and reduces the peak memory footprint by up to 53% over the latest on-device inference frameworks, while still preserving comparable serving accuracy as its INT8 counterparts.
@inproceedings{d2moe_mobicom25, author = {Wang, Haodong and Zhou, Qihua and Hong, Zicong and Guo, Song}, title = {D2MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving}, booktitle = {The 31th Annual International Conference on Mobile Computing and Networking (MobiCom), Hong Kong}, year = {2025} }
AAAI’25
DeNC: Unleash Neural Codecs in Video Streaming with Diffusion Enhancement

Qihua Zhou, Ruibin Li, Jingcai Guo, Yaodong Huang, Zhenda Xu, Laizhong Cui, and Song Guo

In AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA , 2025

Abs Bib PDF

Recent years have witnessed the rise of Neural-enhanced Video Streaming (NeVS), which integrates neural restoration models into video codecs for higher compression-restoration performance. Despite its benefit, existing work has not well explored the full potential of NeVS paradigm, due to: (1) post-streaming restoration by decoder while lacking the proactive collaboration of encoder, (2) end-to-end optimization based on conventional rate-distortion theory, which has been verified that low distortion is not always a synonym for high perceptual quality, and (3) coupled design for domain-specific tasks that cannot generalize to various video codecs. Observing these limitations, our objective is not to incrementally present an improved restoration model. Instead, we focus on the encoder-decoder synergy, i.e., the codec, which is non-trivial since it inherently strikes the rate-distortion-perception trade-off of NeVS. Aiming at this target, we propose the Diffusion-enhanced Neural Codec (DeNC), a plug-and-play module for current NeVS paradigm, to significantly reduce the required bitrates while preserving high perceptual quality of restored videos. Our key design is twofold. First, DeNC improves the encoder’s compression efficiency by simultaneously reducing the resolution and color bit-depth of frame referencing. Second, DeNC empowers the decoder with perception-oriented restoration capability by making its diffusion-based restoration process aware of the encoder’s compression conditions. Real-world evaluations show that DeNC improves compression ratios with nearly an order of magnitude and achieves much higher restoration quality (e.g., 93+ VMAF and 23% higher MOS) over the latest baselines.
@inproceedings{DBLP:conf/aaai/ZhouLGHXC025, author = {Zhou, Qihua and Li, Ruibin and Guo, Jingcai and Huang, Yaodong and Xu, Zhenda and Cui, Laizhong and Guo, Song}, editor = {Walsh, Toby and Shah, Julie and Kolter, Zico}, title = {DeNC: Unleash Neural Codecs in Video Streaming with Diffusion Enhancement}, booktitle = {AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, {USA}}, pages = {1192--1200}, publisher = {{AAAI} Press}, year = {2025}, url = {https://doi.org/10.1609/aaai.v39i1.32107}, doi = {10.1609/AAAI.V39I1.32107}, timestamp = {Thu, 17 Apr 2025 17:08:57 +0200}, biburl = {https://dblp.org/rec/conf/aaai/ZhouLGHXC025.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
AAAI’25
Mjölnir: Breaking the Shield of Perturbation-Protected Gradients via Adaptive Diffusion

Xuan Liu, Siqi Cai, Qihua Zhou, Song Guo, Ruibin Li, and Kaiwei Lin

In AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA , 2025

Abs Bib PDF

Perturbation-based mechanisms, such as differential privacy, mitigate gradient leakage attacks by introducing noise into the gradients, thereby preventing attackers from reconstructing clients’ private data from the leaked gradients. However, can gradient perturbation protection mechanisms truly defend against all gradient leakage attacks? In this paper, we present the first attempt to break the shield of gradient perturbation protection in Federated Learning for the extraction of private information. We focus on common noise distributions, specifically Gaussian and Laplace, and apply our approach to DNN and CNN models. We introduce Mjölnir, a perturbation-resilient gradient leakage attack that is capable of removing perturbations from gradients without requiring additional access to the original model structure or external data. Specifically, we leverage the inherent diffusion properties of gradient perturbation protection to develop a novel diffusion-based gradient denoising model for Mjölnir. By constructing a surrogate client model that captures the structure of perturbed gradients, we obtain crucial gradient data for training the diffusion model. We further utilize the insight that monitoring disturbance levels during the reverse diffusion process can enhance gradient denoising capabilities, allowing Mjölnir to generate gradients that closely approximate the original, unperturbed versions through adaptive sampling steps. Extensive experiments demonstrate that Mjölnir effectively recovers the protected gradients and exposes the Federated Learning process to the threat of gradient leakage, achieving superior performance in gradient denoising and private data recovery.
@inproceedings{DBLP:conf/aaai/LiuCZ0LL25, author = {Liu, Xuan and Cai, Siqi and Zhou, Qihua and Guo, Song and Li, Ruibin and Lin, Kaiwei}, editor = {Walsh, Toby and Shah, Julie and Kolter, Zico}, title = {Mj{\"{o}}lnir: Breaking the Shield of Perturbation-Protected Gradients via Adaptive Diffusion}, booktitle = {AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, {USA}}, pages = {26308--26316}, publisher = {{AAAI} Press}, year = {2025}, url = {https://doi.org/10.1609/aaai.v39i25.34829}, doi = {10.1609/AAAI.V39I25.34829}, timestamp = {Thu, 17 Apr 2025 17:08:58 +0200}, biburl = {https://dblp.org/rec/conf/aaai/LiuCZ0LL25.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
IEEE TNNLS’25
Feature Correlation-guided Knowledge Transfer for Federated Self-supervised Learning

Yi Liu, Song Guo, Jie Zhang, Yufeng Zhan, Qihua Zhou, and Yingchun Wang

IEEE Transactions on Neural Networks and Learning Systems, 2025

Abs Bib PDF

Extensive attention has been paid to the application of Self-supervised Learning (SSL) approaches on Federated Learning (FL) to tackle the label scarcity problem. Previous works on Federated SSL generally fall into two categories: parameter-based model aggregation or data-based feature sharing to achieve knowledge transfer among multiple unlabeled clients. Despite the progress, they inevitably rely on some assumptions, such as homogeneous models or the existence of an additional public dataset, which hinder the universality of the training frameworks for more general scenarios (e.g., unlabeled clients with heterogeneous models). Therefore, in this paper, we propose a novel and general method named Federated Self-supervised Learning with Feature-correlation based Aggregation (FedFoA) to tackle the above limitations. By exchanging feature correlation instead of model parameters or feature mappings, our approach reduces the discrepancies of local representations learning processes, thus promoting collaboration between heterogeneous clients. A factorization-based method is designed to extract the cross-feature relation matrix from local representations, which serves as a knowledge medium for the aggregation phase. We demonstrate that FedFoA is a heterogeneity-supportive and privacy-preserving training framework and can be easily compatible with state-of-the-art federated SSL methods. Extensive empirical experiments demonstrate our proposed approach outperforms the state-of-the-art methods by a significant margin.
@article{10908709, author = {Liu, Yi and Guo, Song and Zhang, Jie and Zhan, Yufeng and Zhou, Qihua and Wang, Yingchun}, journal = {IEEE Transactions on Neural Networks and Learning Systems}, title = {Feature Correlation-guided Knowledge Transfer for Federated Self-supervised Learning}, year = {2025}, volume = {}, number = {}, pages = {1-14}, keywords = {Training;Feature extraction;Matrix decomposition;Correlation;Self-supervised learning;Knowledge transfer;Data models;Vectors;Electronic mail;Collaboration;Contrastive learning;federated learning (FL);QR decomposition;self-supervised learning (SSL)}, doi = {10.1109/TNNLS.2025.3541642} }

2024

IEEE TC’24
Collaborative Neural Architecture Search for Personalized Federated Learning

Yi Liu, Song Guo, Jie Zhang, Zicong Hong, Yufeng Zhan, and Qihua Zhou

IEEE Transactions on Computers, 2024

Abs Bib PDF

Personalized federated learning (pFL) is a promising approach to train customized models for multiple clients over heterogeneous data distributions. However, existing works on pFL often rely on the optimization of model parameters and ignore the personalization demand on neural network architecture, which can greatly affect the model performance in practice. Therefore, generating personalized models with different neural architectures for different clients is a key issue in implementing pFL in a heterogeneous environment. Motivated by Neural Architecture Search (NAS), a model architecture searching methodology, this paper aims to automate the model design in a collaborative manner while achieving good training performance for each client. Specifically, we reconstruct the centralized searching of NAS into the distributed scheme called Personalized Architecture Search (PAS), where differentiable architecture fine-tuning is achieved via gradient-descent optimization, thus making each client obtain the most appropriate model. Furthermore, to aggregate knowledge from heterogeneous neural architectures, a knowledge distillation-based training framework is proposed to achieve a good trade-off between generalization and personalization in federated learning. Extensive experiments demonstrate that our architecture-level personalization method achieves higher accuracy under the non-iid settings, while not aggravating model complexity over state-of-the-art benchmarks.
@article{10713262, author = {Liu, Yi and Guo, Song and Zhang, Jie and Hong, Zicong and Zhan, Yufeng and Zhou, Qihua}, journal = {IEEE Transactions on Computers}, title = {Collaborative Neural Architecture Search for Personalized Federated Learning}, year = {2024}, volume = {74}, number = {1}, pages = {250-262}, keywords = {Computer architecture;Training;Data models;Optimization;Federated learning;Computational modeling;Search problems;Neural networks;Complexity theory;Servers;Personalized federated learning;neural architecture search;knowledge distillation}, doi = {10.1109/TC.2024.3477945} }
IEEE TMC’24
Model Decomposition and Reassembly for Purified Knowledge Transfer in Personalized Federated Learning

Jie Zhang, Song Guo, Xiaosong Ma, Wenchao Xu, Qihua Zhou, Jingcai Guo, Zicong Hong, and Jun Shan

IEEE Trans. Mob. Comput., 2024

Abs Bib PDF

Personalized federated learning (pFL) is to collaboratively train non-identical machine learning models for different clients to adapt to their heterogeneously distributed datasets. State-of-the-art pFL approaches pay much attention on exploiting clients’ inter-similarities to facilitate the collaborative learning process, meanwhile, can barely escape from the irrelevant knowledge pooling that is inevitable during the aggregation phase, and thus hindering the optimization convergence and degrading the personalization performance. To tackle such conflicts between facilitating collaboration and promoting personalization, we propose a novel pFL framework, dubbed pFedC, to first decompose the global aggregated knowledge into several compositional branches, and then selectively reassemble the relevant branches for supporting conflicts-aware collaboration among contradictory clients. Specifically, by reconstructing each local model into a shared feature extractor and multiple decomposed task-specific classifiers, the training on each client transforms into a mutually reinforced and relatively independent multi-task learning process, which provides a new perspective for pFL. Besides, we conduct a purified knowledge aggregation mechanism via quantifying the combination weights for each client to capture clients’ common prior, as well as mitigate potential conflicts from the divergent knowledge caused by the heterogeneous data. Extensive experiments over various models and datasets demonstrate the effectiveness and superior performance of the proposed algorithm.
@article{DBLP:journals/tmc/ZhangGMXZGHS25, author = {Zhang, Jie and Guo, Song and Ma, Xiaosong and Xu, Wenchao and Zhou, Qihua and Guo, Jingcai and Hong, Zicong and Shan, Jun}, title = {Model Decomposition and Reassembly for Purified Knowledge Transfer in Personalized Federated Learning}, journal = {{IEEE} Trans. Mob. Comput.}, volume = {24}, number = {1}, pages = {379--393}, year = {2024}, url = {https://doi.org/10.1109/TMC.2024.3466227}, doi = {10.1109/TMC.2024.3466227}, timestamp = {Sun, 22 Dec 2024 15:49:03 +0100}, biburl = {https://dblp.org/rec/journals/tmc/ZhangGMXZGHS25.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
ACM MM’24, Oral
FreePIH: Training-Free Painterly Image Harmonization with Diffusion Model

Ruibin Li, Jingcai Guo, Qihua Zhou, and Song Guo

In Proceedings of the ACM International Conference on Multimedia (ACM-MM, Oral, CCF-A) , Melbourne, Australia, Jul 2024

Abs Bib PDF

This paper provides an efficient training-free painterly image harmonization (PIH) method, dubbed FreePIH, that leverages only a pre-trained diffusion model to achieve state-of-the-art harmonization results. Unlike existing methods that require either training auxiliary networks or fine-tuning a large pre-trained backbone, or both, to harmonize a foreground object with a painterly-style background image, our FreePIH tames the denoising process as a plug-in module for foreground image style transfer. Specifically, we find that the very last few steps of the denoising (i.e., generation) process strongly correspond to the stylistic information of images, and based on this, we propose to augment the latent features of both the foreground and background images with Gaussians for a direct denoising-based harmonization. To guarantee the fidelity of the harmonized image, we make use of multi-scale features to enforce the consistency of the content and stability of the foreground objects in the latent space, and meanwhile, aligning both fore-/back-grounds with the same style. Moreover, to accommodate the generation with more structural and textural details, we further integrate text prompts to attend to the latent features, hence improving the generation quality. Quantitative and qualitative evaluations on COCO and LAION 5B datasets demonstrate that our method can surpass representative baselines by large margins.
@inproceedings{DBLP:journals/corr/abs-2311-14926, author = {Li, Ruibin and Guo, Jingcai and Zhou, Qihua and Guo, Song}, title = {FreePIH: Training-Free Painterly Image Harmonization with Diffusion Model}, booktitle = {Proceedings of the ACM International Conference on Multimedia (ACM-MM, Oral, CCF-A)}, year = {2024}, month = jul, location = {Melbourne, Australia}, url = {https://doi.org/10.48550/arXiv.2311.14926}, doi = {10.48550/ARXIV.2311.14926}, eprinttype = {arXiv}, eprint = {2311.14926}, timestamp = {Wed, 17 Jul 2024 16:21:24 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2311-14926.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
IJCAI’24
ParsNets: A Parsimonious Orthogonal and Low-Rank Linear Networks for Zero-Shot Learning

Jingcai Guo, Qihua Zhou, Ruibing Li, Xiaocheng Lu, Ziming Liu, Junyang Chen, Xin Xie, and Jie Zhang

In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI, CCF-A) , Jeju, Apr 2024

Abs Bib PDF

This paper provides a novel parsimonious yet efficient design for zero-shot learning (ZSL), dubbed ParsNets, in which we are interested in learning a composition of on-device friendly linear networks, each with orthogonality and low-rankness properties, to achieve equivalent or better performance against deep models. Concretely, we first refactor the core module of ZSL, i.e., the visual-semantics mapping function, into several base linear networks that correspond to diverse components of the semantic space, wherein the complex nonlinearity can be collapsed into simple local linearities. Then, to facilitate the generalization of local linearities, we construct a maximal margin geometry on the learned features by enforcing low-rank constraints on intra-class samples and high-rank constraints on inter-class samples, resulting in orthogonal subspaces for different classes. To enhance the model’s adaptability and counterbalance the over-/under-fittings, a set of sample-wise indicators is employed to select a sparse subset from these base linear networks to form a composite semantic predictor for each sample. Notably, maximal margin geometry can guarantee the diversity of features and, meanwhile, local linearities guarantee efficiency. Thus, our ParsNets can generalize better to unseen classes and can be deployed flexibly on resource-constrained devices.
@inproceedings{DBLP:journals/corr/abs-2312-09709, author = {Guo, Jingcai and Zhou, Qihua and Li, Ruibing and Lu, Xiaocheng and Liu, Ziming and Chen, Junyang and Xie, Xin and Zhang, Jie}, title = {ParsNets: {A} Parsimonious Orthogonal and Low-Rank Linear Networks for Zero-Shot Learning}, booktitle = {Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI, CCF-A)}, year = {2024}, month = apr, location = {Jeju}, url = {https://doi.org/10.48550/arXiv.2312.09709}, doi = {10.48550/ARXIV.2312.09709}, eprinttype = {arXiv}, eprint = {2312.09709}, timestamp = {Tue, 09 Jan 2024 12:22:59 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2312-09709.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
IEEE TPAMI’24
PASS: Patch Automatic Skip Scheme for Efficient On-Device Video Perception

Qihua Zhou, Song Guo, Jun Pan, Jiacheng Liang, Jingcai Guo, Zhenda Xu, and Jingren Zhou

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, CCF-A, IF=23.6), Jan 2024

Abs Bib PDF Code

Real-time video perception tasks are often challenging on resource-constrained edge devices due to the issues of accuracy drop and hardware overhead, where saving computations is the key to performance improvement. Existing methods either rely on domain-specific neural chips or priorly searched models, which require specialized optimization according to different task properties. These limitations motivate us to design a general and task-independent methodology, called Patch Automatic Skip Scheme (PASS), which supports diverse video perception settings by decoupling acceleration and tasks. The gist is to capture inter-frame correlations and skip redundant computations at patch level, where the patch is a non-overlapping square block in visual. PASS equips each convolution layer with a learnable gate to selectively determine which patches could be safely skipped without degrading model accuracy. Specifically, we are the first to construct a self-supervisory procedure for gate optimization, which learns to extract contrastive representations from frame sequences. The pre-trained gates can serve as plug-and-play modules to implement patch-skippable neural backbones, and automatically generate proper skip strategy to accelerate different video-based downstream tasks, e.g., outperforming state-of-the-art MobileHumanPose in 3D pose estimation and FairMOT in multiple object tracking, by up to 9.43× and 12.19× speedups, respectively, on NVIDIA Jetson Nano devices.
@article{DBLP:journals/pami/ZhouGPLGXZ24, author = {Zhou, Qihua and Guo, Song and Pan, Jun and Liang, Jiacheng and Guo, Jingcai and Xu, Zhenda and Zhou, Jingren}, title = {{PASS:} Patch Automatic Skip Scheme for Efficient On-Device Video Perception}, journal = {{IEEE} Transactions on Pattern Analysis and Machine Intelligence (TPAMI, CCF-A, IF=23.6)}, volume = {46}, number = {5}, pages = {3938--3954}, year = {2024}, month = jan, url = {https://doi.org/10.1109/TPAMI.2024.3350380}, doi = {10.1109/TPAMI.2024.3350380}, timestamp = {Sat, 04 May 2024 10:55:20 +0200}, biburl = {https://dblp.org/rec/journals/pami/ZhouGPLGXZ24.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
IEEE TMC’24
Chiron: A Robustness-Aware Incentive Scheme for Edge Learning Via Hierarchical Reinforcement Learning

Yi Liu, Song Guo, Yufeng Zhan, Leijie Wu, Zicong Hong, and Qihua Zhou

IEEE Transactions on Mobile Computing (TMC, CCF-A, IF=7.9), Jan 2024

Abs Bib PDF

Over the past few years, edge learning has achieved significant success in mobile edge networks. Few works have designed incentive mechanism that motivates edge nodes to participate in edge learning. However, most existing works only consider myopic optimization and assume that all edge nodes are honest, which lacks long-term sustainability and the final performance assurance. In this paper, we propose Chiron, an incentive-driven Byzantine-resistant long-term mechanism based on hierarchical reinforcement learning (HRL). First, our optimization goal includes both learning-algorithm performance criteria (i.e., global accuracy) and systematical criteria (i.e., resource consumption), which aim to improve the edge learning performance under a given resource budget. Second, we propose a three-layer HRL architecture to handle long-term optimization, short-term optimization, and byzantine resistance, respectively. Finally, we conduct experiments on various edge learning tasks to demonstrate the superiority of the proposed approach. Specifically, our system can successfully exclude malicious nodes and lazy nodes out of the edge learning participation and achieves 14.96% higher accuracy and 12.66% higher total utility than the state-of-the-art methods under the same budget limit.
@article{10382540, author = {Liu, Yi and Guo, Song and Zhan, Yufeng and Wu, Leijie and Hong, Zicong and Zhou, Qihua}, journal = {IEEE Transactions on Mobile Computing (TMC, CCF-A, IF=7.9)}, title = {Chiron: A Robustness-Aware Incentive Scheme for Edge Learning Via Hierarchical Reinforcement Learning}, year = {2024}, month = jan, volume = {1}, number = {1}, pages = {1-17}, keywords = {Training;Servers;Optimization;Deep learning;Computational modeling;Reinforcement learning;Data models;Deep reinforcement learning;edge learning;incentive mechanism;mobile edge computing}, doi = {10.1109/TMC.2024.3350654} }
AAAI’24
On the Robustness of Neural-Enhanced Video Streaming against Adversarial Attacks

Qihua Zhou, Jingcai Guo, Song Guo, Ruibin Li, Jie Zhang, Bingjie Wang, and Zhenda Xu

In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI, CCF-A) , Vancouver, Canada, Feb 2024

Abs Bib PDF

The explosive growth of video traffic on today’s Internet promotes the rise of Neural-enhanced Video Streaming (NeVS), which effectively improves the rate-distortion trade-off by employing a cheap neural super-resolution model for quality enhancement on the receiver side. Missing by existing work, we reveal that the NeVS pipeline may suffer from a practical threat, where the crucial codec component (i.e., encoder for compression and decoder for restoration) can trigger adversarial attacks in a man-in-the-middle manner to significantly destroy video recovery performance and finally incurs the malfunction of downstream video perception tasks. In this paper, we are the first attempt to inspect the vulnerability of NeVS and discover a novel adversarial attack, called codec hijacking, where the injected invisible perturbation conspires with the malicious encoding matrix by reorganizing the spatial-temporal bit allocation within the bitstream size budget. Such a zero-day vulnerability makes our attack hard to defend because there is no visual distortion on the recovered videos until the attack happens. More seriously, this attack can be extended to diverse enhancement models, thus exposing a wide range of video perception tasks under threat. Evaluation based on state-of-the-art video codec benchmark illustrates that our attack significantly degrades the recovery performance of NeVS over previous attack methods. The damaged video quality finally leads to obvious malfunction of downstream tasks with over 75% success rate. We hope to arouse public attention on codec hijacking and its defence.
@inproceedings{DBLP:conf/aaai/ZhouGGLZWX24, author = {Zhou, Qihua and Guo, Jingcai and Guo, Song and Li, Ruibin and Zhang, Jie and Wang, Bingjie and Xu, Zhenda}, editor = {Wooldridge, Michael J. and Dy, Jennifer G. and Natarajan, Sriraam}, title = {On the Robustness of Neural-Enhanced Video Streaming against Adversarial Attacks}, booktitle = {Proceedings of the {AAAI} Conference on Artificial Intelligence (AAAI, CCF-A)}, pages = {17123--17131}, publisher = {{AAAI} Press}, year = {2024}, month = feb, location = {Vancouver, Canada}, url = {https://doi.org/10.1609/aaai.v38i15.29657}, doi = {10.1609/AAAI.V38I15.29657}, timestamp = {Tue, 02 Apr 2024 16:32:09 +0200}, biburl = {https://dblp.org/rec/conf/aaai/ZhouGGLZWX24.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

2023

IEEE TMC’23
Tree Learning: Towards Promoting Coordination in Scalable Multi-Client Training Acceleration

Tao Guo, Song Guo, Feijie Wu, Wenchao Xu, Jiewei Zhang, Qihua Zhou, Quan Chen, and Weihua Zhuang

IEEE Transactions on Mobile Computing (TMC, CCF-A, IF=7.9), Mar 2023

Abs Bib PDF

Iteration based collaborative learning (CL) paradigms, such as federated learning (FL) and split learning (SL), faces challenges in training neural models over the rapidly growing yet resource-constrained edge devices. Such devices have difficulty in accommodating a full-size large model for FL or affording an excessive waiting time for the mandatory synchronization step in SL. To deal with such challenge, we propose a novel CL framework which adopts an tree-aggregation structure with an adaptive partition and ensemble strategy to achieve optimal synchronization and fast convergence at scale. To find the optimal split point for heterogeneous clients, we also design a novel partitioning algorithm by minimizing the idleness during communication and achieving the optimal synchronization between clients. In addition, a parallelism paradigm is proposed to unleash the potential of optimum synchronization between the clients and server to boost the distributed training process without losing model accuracy for edge devices. Furthermore, we theoretically prove that our framework can achieve better convergence rate than state-of-the-art CL paradigms. We conduct extensive experiments and show that our framework is 4.6× in training speed as compared with the traditional methods, without compromising training accuracy.
@article{DBLP:journals/tmc/GuoGWXZZCZ24, author = {Guo, Tao and Guo, Song and Wu, Feijie and Xu, Wenchao and Zhang, Jiewei and Zhou, Qihua and Chen, Quan and Zhuang, Weihua}, title = {Tree Learning: Towards Promoting Coordination in Scalable Multi-Client Training Acceleration}, journal = {IEEE Transactions on Mobile Computing (TMC, CCF-A, IF=7.9)}, volume = {23}, number = {3}, pages = {2382--2394}, year = {2023}, month = mar, url = {https://doi.org/10.1109/TMC.2023.3259007}, doi = {10.1109/TMC.2023.3259007}, timestamp = {Thu, 29 Feb 2024 20:54:12 +0100}, biburl = {https://dblp.org/rec/journals/tmc/GuoGWXZZCZ24.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
AAAI’23
PASS: Patch Automatic Skip Scheme for Efficient Real-Time Video Perception on Edge Devices

Qihua Zhou, Song Guo, Jun Pan, Jiacheng Liang, Zhenda Xu, and Jingren Zhou

In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI, CCF-A) , Washington, DC, USA, Feb 2023

Abs Bib PDF Code

Real-time video perception tasks are often challenging over the resource-constrained edge devices due to the concerns of accuracy drop and hardware overhead, where saving computations is the key to performance improvement. Existing methods either rely on domain-specific neural chips or priorly searched models, which require specialized optimization according to different task properties. In this work, we propose a general and task-independent Patch Automatic Skip Scheme (PASS), a novel end-to-end learning pipeline to support diverse video perception settings by decoupling acceleration and tasks. The gist is to capture the temporal similarity across video frames and skip the redundant computations at patch level, where patch is a non-overlapping square block in visual. PASS equips each convolution layer with a learnable gate to selectively determine which patches could be safely skipped without degrading model accuracy. As to each layer, a desired gate needs to make flexible skip decisions based on intermediate features without any annotations, which cannot be achieved by conventional supervised learning paradigm. To address this challenge, we are the first to construct a tough self-supervisory procedure for optimizing these gates, which learns to extract contrastive representation, i.e., distinguishing similarity and difference, from frame sequence. These high-capacity gates can serve as a plug-and-play module for convolutional neural network (CNN) backbones to implement patch-skippable architectures, and automatically generate proper skip strategy to accelerate different video-based downstream tasks, e.g., outperforming the state-of-the-art MobileHumanPose (MHP) in 3D pose estimation and FairMOT in multiple object tracking, by up to 9.43 times and 12.19 times speedups, respectively. By directly processing the raw data of frames, PASS can generalize to real-time video streams on commodity edge devices, e.g., NVIDIA Jetson Nano and mobile phones, and achieves efficient performance in realistic deployment.
@inproceedings{DBLP:conf/aaai/Zhou0PLXZ23, author = {Zhou, Qihua and Guo, Song and Pan, Jun and Liang, Jiacheng and Xu, Zhenda and Zhou, Jingren}, editor = {Williams, Brian and Chen, Yiling and Neville, Jennifer}, title = {{PASS:} Patch Automatic Skip Scheme for Efficient Real-Time Video Perception on Edge Devices}, booktitle = {Proceedings of the {AAAI} Conference on Artificial Intelligence (AAAI, CCF-A)}, pages = {3787--3795}, publisher = {{AAAI} Press}, year = {2023}, month = feb, location = {Washington, DC, USA}, url = {https://doi.org/10.1609/aaai.v37i3.25491}, doi = {10.1609/AAAI.V37I3.25491}, timestamp = {Tue, 07 May 2024 20:01:46 +0200}, biburl = {https://dblp.org/rec/conf/aaai/Zhou0PLXZ23.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
AAAI’23
Graph Knows Unknowns: Reformulate Zero-Shot Learning as Sample-Level Graph Recognition

Jingcai Guo, Song Guo, Qihua Zhou, Ziming Liu, Xiaocheng Lu, and Fushuo Huo

In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI, CCF-A) , Washington, DC, USA, Feb 2023

Abs Bib PDF

Zero-shot learning (ZSL) is an extreme case of transfer learning that aims to recognize samples (e.g., images) of unseen classes relying on a train-set covering only seen classes and a set of auxiliary knowledge (e.g., semantic descriptors). Existing methods usually resort to constructing a visual-to-semantics mapping based on features extracted from each whole sample. However, since the visual and semantic spaces are inherently independent and may exist in different manifolds, these methods may easily suffer from the domain bias problem due to the knowledge transfer from seen to unseen classes. Unlike existing works, this paper investigates the fine-grained ZSL from a novel perspective of sample-level graph. Specifically, we decompose an input into several fine-grained elements and construct a graph structure per sample to measure and utilize element-granularity relations within each sample. Taking advantage of recently developed graph neural networks (GNNs), we formulate the ZSL problem to a graph-to-semantics mapping task, which can better exploit element-semantics correlation and local sub-structural information in samples. Experimental results on the widely used benchmark datasets demonstrate that the proposed method can mitigate the domain bias problem and achieve competitive performance against other representative methods.
@inproceedings{DBLP:conf/aaai/GuoGZLLH23, author = {Guo, Jingcai and Guo, Song and Zhou, Qihua and Liu, Ziming and Lu, Xiaocheng and Huo, Fushuo}, editor = {Williams, Brian and Chen, Yiling and Neville, Jennifer}, title = {Graph Knows Unknowns: Reformulate Zero-Shot Learning as Sample-Level Graph Recognition}, booktitle = {Proceedings of the {AAAI} Conference on Artificial Intelligence (AAAI, CCF-A)}, pages = {7775--7783}, publisher = {{AAAI} Press}, year = {2023}, month = feb, location = {Washington, DC, USA}, url = {https://doi.org/10.1609/aaai.v37i6.25942}, doi = {10.1609/AAAI.V37I6.25942}, timestamp = {Mon, 04 Sep 2023 16:50:24 +0200}, biburl = {https://dblp.org/rec/conf/aaai/GuoGZLLH23.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

2022

NeurIPS’22
Hierarchical Channel-spatial Encoding for Communication-efficient Collaborative Learning

Qihua Zhou, Song Guo, Yi Liu, Jie Zhang, Jiewei Zhang, Tao Guo, Zhenda Xu, Xun Liu, and Zhihao Qu

In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS, CCF-A) , New Orleans, USA, Nov 2022

Abs Bib PDF Slides

It witnesses that the collaborative learning (CL) systems often face the performance bottleneck of limited bandwidth, where multiple low-end devices continuously generate data and transmit intermediate features to the cloud for incremental training. To this end, improving the communication efficiency by reducing traffic size is one of the most crucial issues for realistic deployment. Existing systems mostly compress features at pixel level and ignore the characteristics of feature structure, which could be further exploited for more efficient compression. In this paper, we take new insights into implementing scalable CL systems through a hierarchical compression on features, termed Stripe-wise Group Quantization (SGQ). Different from previous unstructured quantization methods, SGQ captures both channel and spatial similarity in pixels, and simultaneously encodes features in these two levels to gain a much higher compression ratio. In particular, we refactor feature structure based on inter-channel similarity and bound the gradient deviation caused by quantization, in forward and backward passes, respectively. Such a double-stage pipeline makes SGQ hold a sublinear convergence order as the vanilla SGD-based optimization. Extensive experiments show that SGQ achieves a higher traffic reduction ratio by up to 15.97 times and provides 9.22 times image processing speedup over the uniform quantized training, while preserving adequate model accuracy as FP32 does, even using 4-bit quantization. This verifies that SGQ can be applied to a wide spectrum of edge intelligence applications.
@inproceedings{DBLP:conf/nips/Zhou0LZZGXLQ22, author = {Zhou, Qihua and Guo, Song and Liu, Yi and Zhang, Jie and Zhang, Jiewei and Guo, Tao and Xu, Zhenda and Liu, Xun and Qu, Zhihao}, editor = {Koyejo, Sanmi and Mohamed, S. and Agarwal, A. and Belgrave, Danielle and Cho, K. and Oh, A.}, title = {Hierarchical Channel-spatial Encoding for Communication-efficient Collaborative Learning}, booktitle = {Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS, CCF-A)}, year = {2022}, month = nov, location = {New Orleans, USA}, url = {http://papers.nips.cc/paper\_files/paper/2022/hash/2616697705f72f16a8eac9c295d37d94-Abstract-Conference.html}, timestamp = {Mon, 08 Jan 2024 16:31:37 +0100}, biburl = {https://dblp.org/rec/conf/nips/Zhou0LZZGXLQ22.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

2021

IEEE TC Spotlight
A Comprehensive Inspection of the Straggler Problem

Qihua Zhou, Song Guo, Haodong Lu, Li Li, Minyi Guo, Yanfei Sun, and Kun Wang

The Spotlight of IEEE Transactions on Computers, Sep 2021

Abs Bib PDF

Parameter server is a popular distributed processing paradigm for operating distributed deep learning (DL) applications. As a growing number of DL models are trained via shared clusters, machines are in confrontation with the heterogeneous environment, which incurs the unexpected phenomenon with a slow task processing speed called straggler. Straggler addressing is a crucial issue in distributed DL applications, since stragglers significantly hamper system performance. While many techniques have been deployed to mitigate stragglers, they may not achieve their goals with the presence of heterogeneity, where systems consume much longer time until DL training convergence than in a homogeneous environment, as evidenced by our experimental study. With the methodology of straggler projection and abstraction of parallelism, a new synchronization mechanism called elastic parallelism synchronous parallel (EPSP) is proposed, which exploits the superiority of iteration acceleration in stale synchronous parallel and conquers the shortage of barrier wasting time in bulk synchronous parallel. More precisely, EPSP supports both enforced and slack synchronization by adjusting the parameter of staleness.
@article{DBLP:journals/computer/ZhouGLLGSW21, author = {Zhou, Qihua and Guo, Song and Lu, Haodong and Li, Li and Guo, Minyi and Sun, Yanfei and Wang, Kun}, title = {A Comprehensive Inspection of the Straggler Problem}, journal = {The Spotlight of IEEE Transactions on Computers}, volume = {54}, number = {10}, pages = {4--5}, year = {2021}, month = sep, url = {https://doi.org/10.1109/MC.2021.3099211}, doi = {10.1109/MC.2021.3099211}, timestamp = {Mon, 15 May 2023 18:00:34 +0200}, biburl = {https://dblp.org/rec/journals/computer/ZhouGLLGSW21.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
IEEE IoTJ’21
On-Device Learning Systems for Edge Intelligence: A Software and Hardware Synergy Perspective

Qihua Zhou, Zhihao Qu, Song Guo, Boyuan Luo, Jingcai Guo, Zhenda Xu, and Rajendra Akerkar

IEEE Internet of Things Journal (IoTJ, JCR-Q1, IF=10.6), Aug 2021

Abs Bib PDF

Modern machine learning (ML) applications are often deployed in the cloud environment to exploit the computational power of clusters. However, this in-cloud computing scheme cannot satisfy the demands of emerging edge intelligence scenarios, including providing personalized models, protecting user privacy, adapting to real-time tasks, and saving resource cost. In order to conquer the limitations of conventional in-cloud computing, there comes the rise of on-device learning, which makes the end-to-end ML procedure totally on user devices, without unnecessary involvement of the cloud. In spite of the promising advantages of on-device learning, implementing a high-performance on-device learning system still faces with many severe challenges, such as insufficient user training data, backward propagation (BP) blocking, and limited peak processing speed. Observing the substantial improvement space in the implementation and acceleration of on-device learning systems, we intend to present a comprehensive analysis of the latest research progress and point out potential optimization directions from the system perspective. This survey presents a software and hardware synergy of on-device learning techniques, covering the scope of model-level neural network design, algorithm-level training optimization, and hardware-level instruction acceleration. We hope this survey could bring fruitful discussions and inspire the researchers to further promote the field of edge intelligence.
@article{DBLP:journals/iotj/ZhouQGLGXA21, author = {Zhou, Qihua and Qu, Zhihao and Guo, Song and Luo, Boyuan and Guo, Jingcai and Xu, Zhenda and Akerkar, Rajendra}, title = {On-Device Learning Systems for Edge Intelligence: {A} Software and Hardware Synergy Perspective}, journal = {IEEE Internet of Things Journal (IoTJ, JCR-Q1, IF=10.6)}, volume = {8}, number = {15}, pages = {11916--11934}, year = {2021}, month = aug, url = {https://doi.org/10.1109/JIOT.2021.3063147}, doi = {10.1109/JIOT.2021.3063147}, timestamp = {Thu, 16 Sep 2021 18:02:07 +0200}, biburl = {https://dblp.org/rec/journals/iotj/ZhouQGLGXA21.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
USENIX ATC’21
Octo: INT8 Training with Loss-aware Compensation and Backward Quantization for Tiny On-device Learning

Qihua Zhou, Song Guo, Zhihao Qu, Jingcai Guo, Zhenda Xu, Jiewei Zhang, Tao Guo, Boyuan Luo, and Jingren Zhou

In Proceedings of the USENIX Annual Technical Conference (USENIX ATC, CCF-A) , Virtual Event, Jul 2021

Abs Bib PDF Video Code Slides

On-device learning is an emerging technique to pave the last mile of enabling edge intelligence, which eliminates the limitations of conventional in-cloud computing where dozens of computational capacities and memories are needed. A high-performance on-device learning system requires breaking the constraints of limited resources and alleviating computational overhead. In this paper, we show that employing the 8-bit fixed-point (INT8) quantization in both forward and backward passes over a deep model is a promising way to enable tiny on-device learning in practice. The key to an efficient quantization-aware training method is to exploit the hardware-level enabled acceleration while preserving the training quality in each layer. However, off-the-shelf quantization methods cannot handle the on-device learning paradigm of fixed-point processing. To overcome these challenges, we propose a novel INT8 training method, which optimizes the computation of forward and backward passes via the delicately designed Loss-aware Compensation (LAC) and Parameterized Range Clipping (PRC), respectively. Specifically, we build a new network component, the compensation layer, to automatically counteract the quantization error of tensor arithmetic. We implement our method in Octo, a lightweight cross-platform system for tiny on-device learning. Evaluation on commercial AI chips shows that Octo holds higher training efficiency over state-of-the-art quantization training methods, while achieving adequate processing speedup and memory reduction over the full-precision training.
@inproceedings{DBLP:conf/usenix/Zhou0QGXZGLZ21, author = {Zhou, Qihua and Guo, Song and Qu, Zhihao and Guo, Jingcai and Xu, Zhenda and Zhang, Jiewei and Guo, Tao and Luo, Boyuan and Zhou, Jingren}, editor = {Calciu, Irina and Kuenning, Geoff}, title = {Octo: {INT8} Training with Loss-aware Compensation and Backward Quantization for Tiny On-device Learning}, booktitle = {Proceedings of the {USENIX} Annual Technical Conference (USENIX ATC, CCF-A)}, pages = {177--191}, publisher = {{USENIX} Association}, year = {2021}, month = jul, location = {Virtual Event}, url = {https://www.usenix.org/conference/atc21/presentation/zhou-qihua}, timestamp = {Thu, 12 Aug 2021 18:08:26 +0200}, biburl = {https://dblp.org/rec/conf/usenix/Zhou0QGXZGLZ21.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
IEEE TPDS’21
Petrel: Heterogeneity-Aware Distributed Deep Learning Via Hybrid Synchronization

Qihua Zhou, Song Guo, Zhihao Qu, Peng Li, Li Li, Minyi Guo, and Kun Wang

IEEE Transactions on Parallel and Distributed Systems (TPDS, CCF-A, IF=5.3), May 2021

Abs Bib PDF

The parameter server (PS) paradigm has achieved great success in deploying large-scale distributed Deep Learning (DL) systems. However, these systems implicitly assume that the cluster is homogeneous and this assumption does not hold in many realworld cases. Although the previous efforts are paid to address heterogeneity, they mainly prioritize the contribution of fast workers and reduce the involvement of slow workers, resulting in the limitations of workload imbalance and computation inefficiency. We reveal that grouping workers into communities, an abstraction proposed by us, and handling parameter synchronization at the community level can conquer these limitations and accelerate the training convergence progress. The inspiration of community comes from our exploration of prior knowledge about the similarity between workers, which is often neglected by previous work. These observations motivate us to propose a new synchronization mechanism named Community-aware Synchronous Parallel (CASP), which uses the Asynchronous Advantage Actor-Critic (A3C)-based algorithm to intelligently determine community configuration and fully improve the synchronization performance. The whole idea has been implemented in a prototype system called Petrel that achieves a good balance between convergence efficiency and communication overhead. The evaluation under various benchmarks with multiple metrics and baseline comparison demonstrates the effectiveness of Petrel. Specifically, Petrel accelerates the training convergence speed by up to 1.87× faster and reduces communication traffic by up to 26.85 percent, on average, over the non-community synchronization mechanisms.
@article{DBLP:journals/tpds/ZhouGQLLGW21, author = {Zhou, Qihua and Guo, Song and Qu, Zhihao and Li, Peng and Li, Li and Guo, Minyi and Wang, Kun}, title = {Petrel: Heterogeneity-Aware Distributed Deep Learning Via Hybrid Synchronization}, journal = {{IEEE} Transactions on Parallel and Distributed Systems (TPDS, CCF-A, IF=5.3)}, volume = {32}, number = {5}, pages = {1030--1043}, year = {2021}, month = may, url = {https://doi.org/10.1109/TPDS.2020.3040601}, doi = {10.1109/TPDS.2020.3040601}, timestamp = {Tue, 20 Apr 2021 11:17:38 +0200}, biburl = {https://dblp.org/rec/journals/tpds/ZhouGQLLGW21.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
IEEE TPDS’21
Canary: Decentralized Distributed Deep Learning Via Gradient Sketch and Partition in Multi-Interface Networks

Qihua Zhou, Kun Wang, Haodong Lu, Wenyao Xu, Yanfei Sun, and Song Guo

IEEE Transactions on Parallel and Distributed Systems (TPDS, CCF-A, IF=5.3), Apr 2021

Abs Bib PDF

The multi-interface networks are efficient infrastructures to deploy distributed Deep Learning (DL) tasks as the model gradients generated by each worker can be exchanged to others via different links in parallel. Although this decentralized parameter synchronization mechanism can reduce the time of gradient exchange, building a high-performance distributed DL architecture still requires the balance of communication efficiency and computational utilization, i.e., addressing the issues of traffic burst, data consistency, and programming convenience. To achieve this goal, we intend to asynchronously exchange gradient pieces without the central control in multi-interface networks. We propose the Piece-level Gradient Exchange and Multi-interface Collective Communication to handle parameter synchronization and traffic transmission, respectively. Specifically, we design the gradient sketch approach based on 8-bit uniform quantization to compress gradient tensors and introduce the colayerabstraction to better handle gradient partition, exchange and pipelining. Also, we provide general programming interfaces to capture the synchronization semantics and build the Gradient Exchange Index (GEI) data structures to make our approach online applicable. We implement our algorithms into a prototype system called Canary by using PyTorch-1.4.0. Experiments conducted in Alibaba Cloud demonstrate that Canary reduces 56.28 percent traffic on average and completes the training by up to 1.61x, 2.28×, and 2.84× faster than BML, Ako on PyTorch, and PS on TensorFlow, respectively.
@article{DBLP:journals/tpds/ZhouWLXSG21, author = {Zhou, Qihua and Wang, Kun and Lu, Haodong and Xu, Wenyao and Sun, Yanfei and Guo, Song}, title = {Canary: Decentralized Distributed Deep Learning Via Gradient Sketch and Partition in Multi-Interface Networks}, journal = {{IEEE} Transactions on Parallel and Distributed Systems (TPDS, CCF-A, IF=5.3)}, volume = {32}, number = {4}, pages = {900--917}, year = {2021}, month = apr, url = {https://doi.org/10.1109/TPDS.2020.3036738}, doi = {10.1109/TPDS.2020.3036738}, timestamp = {Mon, 15 May 2023 18:00:34 +0200}, biburl = {https://dblp.org/rec/journals/tpds/ZhouWLXSG21.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
IEEE TC’21
Falcon: Addressing Stragglers in Heterogeneous Parameter Server Via Multiple Parallelism

Qihua Zhou, Song Guo, Haodong Lu, Li Li, Minyi Guo, Yanfei Sun, and Kun Wang

IEEE Transactions on Computers (TC, CCF-A, IF=3.7), Jan 2021

Abs Bib PDF Code

The parameter server architecture has shown promising performance advantages when handling deep learning (DL) applications. One crucial issue in this regard is the presence of stragglers, which significantly retards DL training progress. Previous solutions for solving stragglers may not fully exploit the computation resource of the cluster as evidenced by our experiments, especially in the heterogeneous environment. This motivates us to design a heterogeneity-aware parameter server paradigm that addresses stragglers and accelerates DL training from the perspective of computation parallelism. We introduce a novel methodology named straggler projection to give a comprehensive inspection of stragglers and reveal practical guidelines to solve this problem in two aspects: (1) controlling each worker’s training speed via elastic training parallelism control and (2) transferring blocked tasks from stragglers to pioneers to fully utilize the computation resource. Following these guidelines, we propose the abstraction of parallelism as an infrastructure and design the Elastic-Parallelism Synchronous Parallel (EPSP) algorithm to handle distributed training and parameter synchronization, supporting both enforcedand slack-synchronization schemes. The whole idea has been implemented into a prototype called Falcon which effectively accelerates the DL training speed with the presence of stragglers. Evaluation under various benchmarks with baseline comparison demonstrates the superiority of our system. Specifically, Falcon reduces the training convergence time, by up to 61.83, 55.19, 38.92, and 23.68 percent shorter than FlexRR, Sync-opt, ConSGD, and DynSGD, respectively.
@article{DBLP:journals/tc/ZhouGLLGSW21, author = {Zhou, Qihua and Guo, Song and Lu, Haodong and Li, Li and Guo, Minyi and Sun, Yanfei and Wang, Kun}, title = {Falcon: Addressing Stragglers in Heterogeneous Parameter Server Via Multiple Parallelism}, journal = {{IEEE} Transactions on Computers (TC, CCF-A, IF=3.7)}, volume = {70}, number = {1}, pages = {139--155}, year = {2021}, month = jan, url = {https://doi.org/10.1109/TC.2020.2974461}, doi = {10.1109/TC.2020.2974461}, timestamp = {Mon, 15 May 2023 18:00:34 +0200}, biburl = {https://dblp.org/rec/journals/tc/ZhouGLLGSW21.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

2020

IEEE MPCE’20
Learning-based Green Workload Placement for Energy Internet in Smart Cities

Qihua Zhou, Yanfei Sun, Haodong Lu, and Kun Wang

IEEE Journal of Modern Power Systems and Clean Energy (MPCE, JCR-Q1, IF=6.3), Dec 2020

Abs Bib PDF

The Energy Internet is a fundamental infrastructure for deploying green city applications, where energy saving and job acceleration are two critical issues to address. In contrast to existing approaches that focus on static metrics with the assumption of complete prior knowledge of resource information, both application-level properties and energy-level requirements are realized in this paper by jointly considering energy saving and job acceleration during job runtime. Considering the online environment of smart city applications, the main objective is transferred as an optimization problem with a model partition and function assignment. To minimize the energy cost and job completion time together, a green workload placement approach is proposed by using the multi-action deep reinforcement learning method. Evaluations with real-world applications demonstrate the superiority of this method over state-of-the-art methods.
@article{9290332, author = {Zhou, Qihua and Sun, Yanfei and Lu, Haodong and Wang, Kun}, journal = {IEEE Journal of Modern Power Systems and Clean Energy (MPCE, JCR-Q1, IF=6.3)}, title = {Learning-based Green Workload Placement for Energy Internet in Smart Cities}, year = {2020}, month = dec, volume = {10}, number = {1}, pages = {91-99}, keywords = {Green products;Computational modeling;Training;Neurons;Smart cities;Optimization;Runtime;Energy saving;workload scheduling;Energy Internet;green city}, doi = {10.35833/MPCE.2020.000271} }
IEEE ICDCS’20
Petrel: Community-aware Synchronous Parallel for Heterogeneous Parameter Server

Qihua Zhou, Song Guo, Peng Li, Yanfei Sun, Li Li, Minyi Guo, and Kun Wang

In Proceedings of the 40th IEEE International Conference on Distributed Computing Systems (ICDCS, CCF-B) , Singapore, Nov 2020

Abs Bib PDF

As to address the impact of heterogeneity in distributed Deep Learning (DL) systems, most previous approaches focus on prioritizing the contribution of fast workers and reducing the involvement of slow workers, incurring the limitations of workload imbalance and computation inefficiency. We reveal that grouping workers into communities, an abstraction proposed by us, and handling parameter synchronization in community level can conquer these limitations and accelerate the training convergence progress. The inspiration of community comes from our exploration of prior knowledge about the similarity between workers, which is often neglected by previous work. These observations motivate us to propose a new synchronization mechanism named Community-aware Synchronous Parallel (CSP), which uses the Asynchronous Advantage Actor-Critic (A3C), a Reinforcement Learning (RL) based algorithm, to intelligently determine community configuration and fully improve the synchronization performance. The whole idea has been implemented in a system called Petrel that achieves a good balance between convergence efficiency and communication overhead. The evaluation under different benchmarks demonstrates our approach can effectively accelerate the training convergence speed and reduce synchro-nization traffic.
@inproceedings{DBLP:conf/icdcs/Zhou00S0G020, author = {Zhou, Qihua and Guo, Song and Li, Peng and Sun, Yanfei and Li, Li and Guo, Minyi and Wang, Kun}, title = {Petrel: Community-aware Synchronous Parallel for Heterogeneous Parameter Server}, booktitle = {Proceedings of the 40th {IEEE} International Conference on Distributed Computing Systems (ICDCS, CCF-B)}, pages = {1183--1184}, publisher = {{IEEE}}, year = {2020}, location = {Singapore}, month = nov, url = {https://doi.org/10.1109/ICDCS47774.2020.00132}, doi = {10.1109/ICDCS47774.2020.00132}, timestamp = {Tue, 02 Mar 2021 14:34:10 +0100}, biburl = {https://dblp.org/rec/conf/icdcs/Zhou00S0G020.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
ACM MM’20
Dual-view Attention Networks for Single Image Super-Resolution

Jingcai Guo, Shiheng Ma, Jie Zhang, Qihua Zhou, and Song Guo

In Proceedings of the 28th ACM International Conference on Multimedia (ACM MM, CCF-A) , Seattle, USA, Oct 2020

Abs Bib PDF

One non-negligible flaw of the convolutional neural networks (CNNs) based single image super-resolution (SISR) models is that most of them are not able to restore high-resolution (HR) images containing sufficient high-frequency information. Worse still, as the depth of CNNs increases, the training easily suffers from the vanishing gradients. These problems hinder the effectiveness of CNNs in SISR. In this paper, we propose the Dual-view Attention Networks to alleviate these problems for SISR. Specifically, we propose the local aware (LA) and global aware (GA) attentions to deal with LR features in unequal manners, which can highlight the high-frequency components and discriminate each feature from LR images in the local and global views, respectively. Furthermore, the local attentive residual-dense (LARD) block that combines the LA attention with multiple residual and dense connections is proposed to fit a deeper yet easy to train architecture. The experimental results verified the effectiveness of our model compared with other state-of-the-art methods.
@inproceedings{DBLP:conf/mm/GuoMZZ020, author = {Guo, Jingcai and Ma, Shiheng and Zhang, Jie and Zhou, Qihua and Guo, Song}, editor = {Chen, Chang Wen and Cucchiara, Rita and Hua, Xian{-}Sheng and Qi, Guo{-}Jun and Ricci, Elisa and Zhang, Zhengyou and Zimmermann, Roger}, title = {Dual-view Attention Networks for Single Image Super-Resolution}, booktitle = {Proceedings of the 28th {ACM} International Conference on Multimedia (ACM MM, CCF-A)}, pages = {2728--2736}, publisher = {{ACM}}, year = {2020}, location = {Seattle, USA}, month = oct, url = {https://doi.org/10.1145/3394171.3413613}, doi = {10.1145/3394171.3413613}, timestamp = {Thu, 04 Aug 2022 14:18:09 +0200}, biburl = {https://dblp.org/rec/conf/mm/GuoMZZ020.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

2019

IEEE TC’19
Fast Coflow Scheduling via Traffic Compression and Stage Pipelining in Datacenter Networks

Qihua Zhou, Kun Wang, Peng Li, Deze Zeng, Song Guo, Baoliu Ye, and Minyi Guo

IEEE Transactions on Computers (TC, CCF-A, IF=3.7), Dec 2019

Abs Bib PDF Code

Big data analytics in datacenters often involve scheduling of data-parallel jobs. Traditional scheduling techniques based on improving network resource utilization are subject to limited bandwidth in datacenter networks. To alleviate the shortage of bandwidth, some cluster frameworks employ techniques of traffic compression to reduce transmission consumption. However, they tackle scheduling in a coarse-grained manner at task level and do not perform well in terms of flow-level metrics due to high complexity. Fortunately, the abstraction of coflow pioneers a new perspective to facilitate scheduling efficiency. In this paper, we introduce a coflow compression mechanism to minimize the completion time in data-intensive applications. Due to the NP-hardness, we propose a heuristic algorithm called Fastest-Volume-Disposal-First (FVDF) to solve this problem. For online applicability, FVDF supports stage pipelining to accelerate scheduling and exploits recurrent neural networks (RNNs) to predict compression speed. Meanwhile, we build Swallow, an efficient scheduling system that implements our proposed algorithms. It minimizes coflow completion time (CCT) while guaranteeing resource conservation and starvation freedom. The results of both trace-driven simulations and real experiments show the superiority of our algorithm, over existing one. Specifically, Swallow speeds up CCT and job completion time (JCT) by up to 1.47× and 1.66× on average, respectively, over the SEBF in Varys, one of the most efficient coflow scheduling algorithms so far. Moreover, with coflow compression, Swallow reduces data traffic by up to 48.41 percent on average.
@article{DBLP:journals/tc/ZhouWLZGYG19, author = {Zhou, Qihua and Wang, Kun and Li, Peng and Zeng, Deze and Guo, Song and Ye, Baoliu and Guo, Minyi}, title = {Fast Coflow Scheduling via Traffic Compression and Stage Pipelining in Datacenter Networks}, journal = {{IEEE} Transactions on Computers (TC, CCF-A, IF=3.7)}, volume = {68}, number = {12}, pages = {1755--1771}, year = {2019}, month = dec, url = {https://doi.org/10.1109/TC.2019.2931716}, doi = {10.1109/TC.2019.2931716}, timestamp = {Tue, 16 Aug 2022 23:06:44 +0200}, biburl = {https://dblp.org/rec/journals/tc/ZhouWLZGYG19.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
IEEE ICDCS’19
Falcon: Towards Computation-Parallel Deep Learning in Heterogeneous Parameter Server

Qihua Zhou, Kun Wang, Song Guo, Haodong Lu, Li Li, Minyi Guo, and Yanfei Sun

In Proceedings of the 39th IEEE International Conference on Distributed Computing Systems (ICDCS, CCF-B) , Dallas, USA, Jul 2019

Abs Bib PDF Code

Parameter server paradigm has shown great performance superiority for handling deep learning (DL) applications. One crucial issue in this regard is the presence of stragglers, which significantly retards DL training progress. Previous solutions for solving straggler may not fully exploit the computation capacity of a cluster as evidenced by our experiments. This motivates us to make an attempt at building a new parameter server architecture that mitigates and addresses stragglers in heterogeneous DL from the perspective of computation parallelism. We introduce a novel methodology named straggler projection to give a comprehensive inspection of stragglers and reveal practical guidelines for resolving this problem: (1) reducing straggler emergence frequency via elastic parallelism control and (2) transferring blocked tasks to pioneer workers for fully exploiting cluster computation capacity. Following the guidelines, we propose the abstraction of parallelism as an infrastructure and elaborate the Elastic-Parallelism Synchronous Parallel (EPSP) that supports both enforced-and slack-synchronization schemes. The whole idea has been implemented in a prototype called Falcon which efficiently accelerates the DL training progress with the presence of stragglers. Evaluation under various benchmarks with baseline comparison evidences the superiority of our system. Specifically, Falcon yields shorter convergence time, by up to 61.83%, 55.19%, 38.92% and 23.68% reduction over FlexRR, Sync-opt, ConSGD and DynSGD, respectively.
@inproceedings{DBLP:conf/icdcs/Zhou00L0GS19, author = {Zhou, Qihua and Wang, Kun and Guo, Song and Lu, Haodong and Li, Li and Guo, Minyi and Sun, Yanfei}, title = {Falcon: Towards Computation-Parallel Deep Learning in Heterogeneous Parameter Server}, booktitle = {Proceedings of the 39th {IEEE} International Conference on Distributed Computing Systems (ICDCS, CCF-B)}, pages = {196--206}, publisher = {{IEEE}}, year = {2019}, location = {Dallas, USA}, month = jul, url = {https://doi.org/10.1109/ICDCS.2019.00028}, doi = {10.1109/ICDCS.2019.00028}, timestamp = {Mon, 15 May 2023 18:00:34 +0200}, biburl = {https://dblp.org/rec/conf/icdcs/Zhou00L0GS19.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

2018

IEEE COMST’18
Cluster Frameworks for Efficient Scheduling and Resource Allocation in Data Center Networks: A Survey

Kun Wang, Qihua Zhou, Song Guo, and Jiangtao Luo

IEEE Communications Surveys & Tutorials (COMST, JCR-Q1, IF=35.6), Jul 2018

Abs Bib PDF

Data centers are widely used for big data analytics, which often involve data-parallel jobs, including query and web service. Meanwhile, cluster frameworks are rapidly developed for data-intensive applications in data center networks (DCNs). To promote the performance of these frameworks, many efforts have been paid to improve scheduling strategies and resource allocation algorithms. With the deployment of geo-distributed data centers and data-intensive applications, the optimization in DCNs regains pervasive attention in both industry and academia. Many solutions, such as the coflow-aware scheduling and speculative execution, have been proposed to meet various requirements. Therefore, we present a solid starting ground and comprehensive overview in this area to help readers quickly understand stateof-the-art technologies and research progress. We observe that algorithms in cluster frameworks are implemented with different guidelines and can be classified according to scheduling granularity, controller management, and prior-knowledge requirement. In addition, mechanisms for conquering crucial challenges in DCNs are discussed, including providing low latency and minimizing job completion time. Moreover, we analyze desirable properties of fault tolerance and scalability to illuminate the design principles of distributed systems. We hope that this paper will shed light on this promising land and serve as a guide for further researches.
@article{DBLP:journals/comsur/WangZGL18, author = {Wang, Kun and Zhou, Qihua and Guo, Song and Luo, Jiangtao}, title = {Cluster Frameworks for Efficient Scheduling and Resource Allocation in Data Center Networks: {A} Survey}, journal = {{IEEE} Communications Surveys & Tutorials (COMST, JCR-Q1, IF=35.6)}, volume = {20}, number = {4}, pages = {3560--3580}, year = {2018}, month = jul, url = {https://doi.org/10.1109/COMST.2018.2857922}, doi = {10.1109/COMST.2018.2857922}, timestamp = {Thu, 09 Apr 2020 17:12:26 +0200}, biburl = {https://dblp.org/rec/journals/comsur/WangZGL18.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
IEEE IPDPS’18
Swallow: Joint Online Scheduling and Coflow Compression in Datacenter Networks

Qihua Zhou, Peng Li, Kun Wang, Deze Zeng, Song Guo, and Minyi Guo

In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS, CCF-B) , Vancouver, Canada, May 2018

Abs Bib PDF Code

Big data analytics in datacenters often involves scheduling of data-parallel job, which are bottlenecked by limited bandwidth of datacenter networks. To alleviate the shortage of bandwidth, some existing work has proposed traffic compression to reduce the amount of data transmitted over the network. However, their proposed traffic compression works in a coarse-grained manner at job level, leaving a large optimization space unexplored for further performance improvement. In this paper, we propose a flow-level traffic compression and scheduling system, called Swallow, to accelerate data-intensive applications. Specifically, we target on coflows, which is an elegant abstraction of parallel flows generated by big data jobs. With the objective of minimizing coflow completion time (CCT), we propose a heuristic algorithm called Fastest-Volume-Disposal-First (FVDV) and implement Swallow based on Spark. The results of both trace-driven simulations and real experiments show the superiority of our system, over existing algorithms. Swallow can reduce CCT and job completion time (JCT) by up to 1.47× and 1.66× on average, respectively, over the SEBF in Varys, one of the most efficient coflow scheduling algorithms so far. Moreover, with coflow compression, Swallow reduces data traffic by up to 48.41% on average.
@inproceedings{DBLP:conf/ipps/ZhouLWZ0G18, author = {Zhou, Qihua and Li, Peng and Wang, Kun and Zeng, Deze and Guo, Song and Guo, Minyi}, title = {Swallow: Joint Online Scheduling and Coflow Compression in Datacenter Networks}, booktitle = {Proceedings of the {IEEE} International Parallel and Distributed Processing Symposium (IPDPS, CCF-B)}, pages = {505--514}, publisher = {{IEEE} Computer Society}, year = {2018}, month = may, location = {Vancouver, Canada}, url = {https://doi.org/10.1109/IPDPS.2018.00060}, doi = {10.1109/IPDPS.2018.00060}, timestamp = {Fri, 24 Mar 2023 00:02:03 +0100}, biburl = {https://dblp.org/rec/conf/ipps/ZhouLWZ0G18.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }