Background: Traditional hand gesture recognition systems primarily rely on
optical cameras or wearable sensors, which often face challenges such as privacy
concerns, sensitivity to lighting conditions, and the need for physical contact. WiFi-based sensing using Channel State Information (CSI) has emerged as a
promising non-intrusive and privacy-preserving alternative. However, the
performance of these systems is frequently degraded by environmental noise,
multipath effects, and the lack of generalizability across different domains and
locations.
Aim: This research aims to develop a robust and highly accurate deep learningbased framework for hand gesture recognition that can effectively generalize across
various environments and users. The goal is to minimize the performance drop in
"cross-domain" scenarios without requiring extensive retraining for each new
setting.
Methodology: In this study, we propose a novel architecture named Dual Attention
and Cross-Fusion Network (DACN). This model adopts a dual-stream strategy to
process Wi-Fi signal components—specifically phase information and Doppler
Frequency Shift (DFS)—in parallel. We utilize a ResNet-18 backbone for basic
feature extraction, integrated with a Dual Attention Mechanism (comprising
Channel and Spatial attention gates) to focus on motion-relevant features and
suppress environmental noise. The model's performance was rigorously evaluated
using 180 experimental scenarios across three benchmark datasets: ARIL, CSIDA,
and Widar3.0, incorporating various data augmentation and preprocessing
techniques.Conclusions: The experimental results demonstrate that the proposed
DACN model achieves superior stability and accuracy in diverse environments. By
effectively fusing phase and Doppler features through the attention mechanism, the
system shows high resilience to multipath interference and user-specific variations.
The findings indicate that the integration of deep residual learning with dual
attention significantl