[Objective] Images acquired by a single remote sensing sensor are inherently constrained by hardware and physical limitations, making it difficult to simultaneously achieve high spatial resolution and high spectral resolution. Hyperspectral images provide rich spectral information but usually suffer from low spatial resolution, whereas multispectral images contain finer spatial details at the cost of reduced spectral fidelity. Effectively fusing these two complementary modalities remains a challenging task, particularly in preserving spectral consistency while enhancing spatial structures. To address this issue, we propose a local–global collaborative multi-scale feature augmentation method for hyperspectral and multispectral image fusion. The core objective of the proposed approach is to fully exploit the complementary spatial and spectral characteristics of heterogeneous data sources through a unified deep learning framework, thereby generating fused images with both high spatial detail and high spectral accuracy. [Methods] The proposed fusion framework is composed of four cooperative modules: a feature extraction module, a feature fusion module, a feature augmentation module, and an image reconstruction module. First, the feature extraction module independently encodes the hyperspectral and multispectral inputs using dedicated convolutional layers to obtain hierarchical spectral and spatial feature representations. This design ensures that modality-specific characteristics are effectively preserved at the early stages of processing. Next, the feature fusion module integrates the extracted features into a shared latent space, enabling cross-modal interaction and alignment between hyperspectral spectral features and multispectral spatial features. The core component of the framework is the feature augmentation module, which is specifically designed to enhance feature representations from both local and global perspectives. This module is divided into a local feature augmentation sub-module and a global feature augmentation sub-module. The local feature augmentation sub-module employs multiple convolutional blocks with different receptive fields to strengthen fine-grained spatial details, such as edges, textures, and local structures, which are critical for improving spatial resolution. In contrast, the global feature augmentation sub-module focuses on modeling long-range dependencies and global contextual information. It integrates spectral–spatial fusion Transformer blocks to capture complex correlations across spectral bands and spatial locations, and combines them with multi-scale convolutional blocks to enhance global feature expressiveness and robustness. By jointly considering local details and global context, the proposed augmentation strategy achieves a balanced and comprehensive feature enhancement. Finally, the image reconstruction module maps the augmented fusion features back to the image domain, producing the final high-resolution hyperspectral image. [Conclusions] Extensive experiments conducted on multiple benchmark hyperspectral and multispectral datasets demonstrate the effectiveness of the proposed method. Both quantitative evaluations and qualitative visual comparisons show that the proposed approach consistently outperforms existing state-of-the-art fusion methods in terms of spatial detail preservation, spectral fidelity, and overall fusion quality. The results indicate that the local–global collaborative multi-scale feature augmentation strategy can effectively mitigate the spatial–spectral trade-off inherent in single-sensor imaging systems. Consequently, the proposed method provides a robust and versatile solution for hyperspectral and multispectral image fusion, with strong potential for practical applications in remote sensing, environmental monitoring, and related fields.