Problem descriptionGreen building energy consumption prediction requires understanding the complex mathematical relationships between building systems, environmental conditions, and temporal dynamics. This section establishes the mathematical foundation of the research scenario through a systematic formulation that progressively reveals the inherent challenges.The fundamental energy consumption prediction problem in green buildings can be initially formulated as a basic regression task. At its simplest form, the relationship between input variables and energy consumption can be expressed as:$$\begin{aligned} \hat{E}_{t+1} = f(\textbf{X}_t) \end{aligned}$$(1)where \(\hat{E}_{t+1}\) represents the predicted energy consumption at time \(t+1\), \(\textbf{X}_t\) denotes the input feature vector at time t, and \(f(\cdot )\) is the prediction function. However, this basic formulation fails to capture the temporal complexities inherent in green building systems.Green building energy consumption exhibits strong temporal dependencies that extend beyond simple autoregressive patterns. The energy demand at any given time is influenced by consumption patterns from multiple previous time steps, necessitating a more sophisticated temporal formulation:$$\begin{aligned} \hat{E}_{t+\Delta t} = f(E_{t-\tau :t}, \textbf{C}_{t-\tau :t}) \end{aligned}$$(2)where \(E_{t-\tau :t} = \{E_{t-\tau }, E_{t-\tau +1}, \ldots , E_t\}\) represents the historical energy consumption sequence over a time window \(\tau\), \(\textbf{C}_{t-\tau :t}\) denotes the sequence of building control states, and \(\Delta t\) is the prediction horizon. The window size \(\tau\) must be sufficiently large to capture long-term dependencies while remaining computationally tractable.Climate variability introduces additional complexity as weather conditions directly influence energy demand through heating, cooling, and lighting requirements. The mathematical relationship between climate variables and energy consumption can be expressed through a climate-coupled model:$$\begin{aligned} \hat{E}_{t+\Delta t} = f(E_{t-\tau :t}, \textbf{W}_{t-\tau :t}, \textbf{S}_t) \end{aligned}$$(3)where \(\textbf{W}_{t-\tau :t} = \{W_{t-\tau }, W_{t-\tau +1}, \ldots , W_t\}\) represents the historical weather sequence with \(W_t = [T_t, H_t, P_t, R_t, I_t]^T\) containing temperature \(T_t\), humidity \(H_t\), precipitation \(P_t\), wind speed \(R_t\), and solar irradiance \(I_t\), while \(\textbf{S}_t\) captures seasonal and cyclical patterns that modulate climate effects.Real-world green building energy consumption exhibits multi-scale temporal patterns ranging from short-term fluctuations to long-term seasonal cycles. This multi-scale behavior requires consideration of different temporal frequencies simultaneously:$$\begin{aligned} \hat{E}_{t+\Delta t} = \sum _{k=1}^{K} \alpha _k \cdot f_k(E_{t-\tau _k:t}, \textbf{W}_{t-\tau _k:t}) + \sum _{j=1}^{J} \beta _j \cdot g_j(\textbf{S}_{t,j}) \end{aligned}$$(4)where \(f_k(\cdot )\) represents prediction functions operating at different time scales with corresponding window sizes \(\tau _k\), \(g_j(\cdot )\) captures seasonal components with \(\textbf{S}_{t,j}\) representing seasonal features at different frequencies, \(\alpha _k\) and \(\beta _j\) are weighting coefficients, and K and J denote the number of temporal scales and seasonal components respectively.Cross-domain generalization challenges arise when applying prediction models across different building types, geographic locations, and climate zones. The domain adaptation problem can be mathematically formulated as learning a function that performs well across multiple data distributions:$$\begin{aligned} \mathscr {F}^* = \arg \min _{\mathscr {F}} \sum _{d=1}^{D} w_d \cdot \mathbb {E}_{(\textbf{X},E) \sim \mathscr {P}_d} \left[ \ell (E, \mathscr {F}(\textbf{X})) \right] \end{aligned}$$(5)where \(\mathscr {P}_d\) represents the data distribution for domain d, \(w_d\) denotes domain-specific weights, D is the total number of domains, \(\ell (\cdot , \cdot )\) is a loss function, and \(\mathscr {F}\) is the prediction function that must generalize across domains with potentially different statistical properties.The comprehensive prediction challenge requires balancing accuracy, computational efficiency, and adaptability to changing conditions. This can be expressed through a multi-objective optimization framework:$$\begin{aligned} \min _{\theta } \left\{ \mathscr {L}_{pred}(\theta ) + \lambda _1 \mathscr {L}_{adapt}(\theta ) + \lambda _2 \mathscr {C}_{comp}(\theta ) \right\} \end{aligned}$$(6)where \(\theta\) represents model parameters, \(\mathscr {L}_{pred}(\theta )\) measures prediction accuracy, \(\mathscr {L}_{adapt}(\theta )\) quantifies adaptation capability across different conditions, \(\mathscr {C}_{comp}(\theta )\) represents computational complexity constraints, and \(\lambda _1, \lambda _2\) are regularization parameters balancing these competing objectives.Problem 1Given the mathematical formulation of green building energy consumption complexity outlined above, the research problem can be formally stated as developing an integrated prediction framework that addresses temporal dependencies, climate variability, multi-scale patterns, and cross-domain generalization simultaneously. The objective is to find an optimal prediction function \(\mathscr {F}^*\) that minimizes the expected prediction error while maintaining adaptability across diverse operational conditions:$$\begin{aligned} \mathscr {F}^* = \arg \min _{\mathscr {F}} \mathbb {E} \left[ \ell \left( E_{t+\Delta t}, \mathscr {F}(E_{t-\tau :t}, \textbf{W}_{t-\tau :t}, \textbf{D}) \right) \right] + \Omega (\mathscr {F}) \end{aligned}$$(7)where \(\textbf{D}\) represents domain-specific characteristics, \(\Omega (\mathscr {F})\) is a regularization term promoting generalization, and the expectation is taken over all possible building types, climate conditions, and temporal patterns encountered in green building applications.Motivation of Sequence to Sequence (Seq2Seq) Reinforcement Learning (RL) ModelTraditional methods for predicting building energy consumption focus on static data analysis, ignoring the time series characteristics of dynamic changes in energy consumption18. In addition, these methods perform poorly in predicting energy consumption under adverse weather conditions, especially in situations that require adaptation to environmental changes and corresponding adjustment strategies19,20.To overcome these limitations of existing methods, this study proposed a composite prediction model that integrates Seq2Seq and RL. The Seq2Seq model can effectively capture the long-term dependencies of time series data and accurately predict future energy consumption. Our model can automatically learn and adjust prediction strategies combining with RL to adapt to constantly changing environmental conditions.The integrated framework architecture demonstrates the synergistic combination of multiple advanced components for climate-adaptive energy prediction. Figure 2 illustrates how the Seq2Seq and reinforcement learning components work together to process temporal data and adapt to changing environmental conditions.Fig. 2Architecture of the composite prediction model integrating Seq2Seq and reinforcement learning.Full size imageMathematical formulation of Seq2Seq-RL frameworkThe Seq2Seq-RL framework addresses the temporal dependency and environmental adaptation challenges by integrating sequence-to-sequence learning with reinforcement learning policies. This approach establishes the foundation for adaptive energy consumption prediction in green buildings under dynamic climate conditions.The Seq2Seq encoder-decoder architecture processes historical energy and climate data through bidirectional LSTM networks to capture complex temporal dependencies. The encoder transforms the input sequence into a comprehensive context representation:$$\begin{aligned} \textbf{h}_t^{enc} = \text {BiLSTM}_{enc}\left( \left[ \begin{array}{c} E_t \\ \textbf{W}_t \\ \textbf{C}_t \end{array}\right] , \textbf{h}_{t-1}^{enc}\right) , \quad \textbf{z}_{context} = \frac{1}{\tau }\sum _{t=1}^{\tau } \tanh \left( \textbf{W}_{ctx}\textbf{h}_t^{enc} + \textbf{b}_{ctx}\right) \end{aligned}$$(8)where \(\textbf{h}_t^{enc} \in \mathbb {R}^{d_{enc}}\) represents the encoder hidden state, \(E_t\) is the energy consumption, \(\textbf{W}_t\) denotes weather conditions, \(\textbf{C}_t\) represents building control states, \(\textbf{W}_{ctx} \in \mathbb {R}^{d_{ctx} \times d_{enc}}\) and \(\textbf{b}_{ctx} \in \mathbb {R}^{d_{ctx}}\) are learned parameters, and \(\textbf{z}_{context}\) provides the compressed context representation for decoding.The reinforcement learning component defines the state space, action space, and reward structure specifically tailored for green building energy prediction. The state representation integrates temporal patterns, climate conditions, and prediction uncertainty:$$\begin{aligned} \textbf{s}_t = \left[ \begin{array}{c} \textbf{z}_{context} \\ \sigma (\textbf{W}_{pred}\textbf{h}_t^{dec} + \textbf{b}_{pred}) \\ \text {softmax}(\textbf{W}_{climate}\textbf{W}_t + \textbf{b}_{climate}) \\ \log \left( 1 + \Vert \hat{E}_t - E_t\Vert _2^2\right) \end{array}\right] , \quad \textbf{a}_t = \tanh \left( \textbf{W}_{action}\textbf{s}_t + \textbf{b}_{action}\right) \odot \varvec{\delta }_{bound} \end{aligned}$$(9)where \(\textbf{h}_t^{dec}\) is the decoder hidden state, \(\sigma (\cdot )\) represents the sigmoid function, the climate sensitivity term captures weather impact, the prediction error term quantifies current performance, \(\textbf{a}_t \in [-1,1]^{d_{action}}\) represents continuous actions for prediction adjustment, and \(\varvec{\delta }_{bound}\) enforces action constraints to ensure physically reasonable adjustments.The reward function encourages accurate predictions while promoting stable learning and climate adaptability. The comprehensive reward structure incorporates multiple objectives essential for green building energy management:$$\begin{aligned} r_t = -\alpha _{acc} \cdot \exp \left( \frac{\Vert \hat{E}_t - E_t\Vert _2^2}{\sigma _{energy}^2}\right) + \alpha _{stab} \cdot \exp \left( -\frac{\Vert \textbf{a}_t - \textbf{a}_{t-1}\Vert _2^2}{\sigma _{action}^2}\right) + \alpha _{climate} \cdot \sum _{k=1}^{K} w_k \cos \left( \frac{2\pi t}{T_k}\right) \mathbb {I}_{climate}(k,t) \end{aligned}$$(10)where \(\alpha _{acc}, \alpha _{stab}, \alpha _{climate}\) are weighting coefficients, \(\sigma _{energy}^2\) and \(\sigma _{action}^2\) are normalization factors, \(w_k\) represents climate-specific weights, \(T_k\) denotes seasonal periods, and \(\mathbb {I}_{climate}(k,t)\) is an indicator function that activates during specific climate conditions to encourage appropriate responses to weather variations.The policy network and value function operate in conjunction to optimize prediction strategies through actor-critic learning. The policy network generates probability distributions over continuous action spaces while the value function estimates expected future rewards:$$\begin{aligned} \pi _{\theta }(\textbf{a}_t|\textbf{s}_t) = \mathscr {N}\left( \varvec{\mu }_{\theta }(\textbf{s}_t), \varvec{\Sigma }_{\theta }(\textbf{s}_t)\right) , \quad V_{\phi }(\textbf{s}_t) = \textbf{w}_v^T \tanh \left( \textbf{W}_v^{(2)} \text {ReLU}\left( \textbf{W}_v^{(1)} \textbf{s}_t + \textbf{b}_v^{(1)}\right) + \textbf{b}_v^{(2)}\right) \end{aligned}$$(11)where \(\varvec{\mu }_{\theta }(\textbf{s}_t) = \textbf{W}_{\mu } \text {ReLU}(\textbf{W}_s \textbf{s}_t + \textbf{b}_s) + \textbf{b}_{\mu }\) and \(\varvec{\Sigma }_{\theta }(\textbf{s}_t) = \text {softplus}(\textbf{W}_{\sigma } \text {ReLU}(\textbf{W}_s \textbf{s}_t + \textbf{b}_s) + \textbf{b}_{\sigma })\) parameterize the mean and covariance of the action distribution, \(\theta\) and \(\phi\) represent policy and value network parameters respectively, and the multi-layer architecture enables complex strategy learning.Attention mechanisms enhance the framework by dynamically weighting temporal features based on their relevance to current prediction tasks and climate conditions. The climate-aware attention computation integrates weather sensitivity with temporal importance:$$\begin{aligned} \alpha _{t,i} = \frac{\exp \left( \textbf{v}_{att}^T \tanh \left( \textbf{W}_{att}^{(h)} \textbf{h}_i^{enc} + \textbf{W}_{att}^{(s)} \textbf{s}_t + \textbf{W}_{att}^{(w)} \textbf{W}_i\right) \right) }{\sum _{j=1}^{\tau } \exp \left( \textbf{v}_{att}^T \tanh \left( \textbf{W}_{att}^{(h)} \textbf{h}_j^{enc} + \textbf{W}_{att}^{(s)} \textbf{s}_t + \textbf{W}_{att}^{(w)} \textbf{W}_j\right) \right) } \cdot \left( 1 + \beta _{climate} \Vert \textbf{W}_i - \textbf{W}_t\Vert _2^{-1}\right) \end{aligned}$$(12)where \(\alpha _{t,i}\) represents the attention weight for time step i at current time t, \(\textbf{v}_{att}\), \(\textbf{W}_{att}^{(h)}\), \(\textbf{W}_{att}^{(s)}\), and \(\textbf{W}_{att}^{(w)}\) are learned attention parameters, \(\beta _{climate}\) controls climate sensitivity, and the climate proximity term \(\Vert \textbf{W}_i - \textbf{W}_t\Vert _2^{-1}\) enhances attention to historical periods with similar weather conditions.The integrated loss function combines sequence prediction accuracy, policy optimization objectives, and regularization terms to ensure stable learning and robust performance across diverse operational conditions:$$\begin{aligned} & \mathscr {L}_{total} = \mathbb {E}_{t=1}^{T} \left[ \Vert \hat{E}_t - E_t\Vert _2^2 + \lambda _{entropy} \mathscr {H}(\pi _{\theta }(\cdot |\textbf{s}_t))\right] \nonumber \\ & - \mathbb {E}_{\tau } \left[ \sum _{t=1}^{T} \gamma ^{t-1} r_t \log \pi _{\theta }(\textbf{a}_t|\textbf{s}_t)\right] + \lambda _{value} \mathbb {E}_{t=1}^{T} \left[ \left( V_{\phi }(\textbf{s}_t) - R_t\right) ^2\right] + \lambda _{reg} \left( \Vert \theta \Vert _2^2 + \Vert \phi \Vert _2^2\right) \end{aligned}$$(13)where \(\mathscr {H}(\cdot )\) denotes entropy to encourage exploration, \(\gamma\) is the discount factor, \(R_t = \sum _{k=t}^{T} \gamma ^{k-t} r_k\) represents discounted future rewards, \(\lambda _{entropy}\), \(\lambda _{value}\), and \(\lambda _{reg}\) are regularization coefficients, and the expectation over trajectories \(\tau\) captures the stochastic nature of the learning process.The parameter update mechanism employs proximal policy optimization with adaptive learning rates to ensure stable convergence while maintaining prediction quality. The update rules incorporate momentum and gradient clipping for robust optimization:$$\begin{aligned} \begin{aligned} \textbf{g}_{\theta ,t}&= \nabla _{\theta } \mathscr {L}_{policy}(\theta ) + \lambda _{kl} \nabla _{\theta } D_{KL}(\pi _{\theta _{old}}(\cdot |\textbf{s}_t) \Vert \pi _{\theta }(\cdot |\textbf{s}_t)), \\ \textbf{m}_{\theta ,t}&= \beta _1 \textbf{m}_{\theta ,t-1} + (1-\beta _1) \textbf{g}_{\theta ,t}, \quad \textbf{v}_{\theta ,t} = \beta _2 \textbf{v}_{\theta ,t-1} + (1-\beta _2) \textbf{g}_{\theta ,t}^2, \\ \theta _{t+1}&= \theta _t - \eta _t \frac{\textbf{m}_{\theta ,t}}{\sqrt{\textbf{v}_{\theta ,t}} + \epsilon } \cdot \min \left( 1, \frac{C_{grad}}{\Vert \textbf{g}_{\theta ,t}\Vert _2}\right) \end{aligned} \end{aligned}$$(14)where \(D_{KL}\) represents Kullback-Leibler divergence for policy constraint, \(\beta _1, \beta _2\) are momentum parameters, \(\eta _t\) is the adaptive learning rate, \(C_{grad}\) is the gradient clipping threshold, and similar updates apply to value function parameters \(\phi\) with appropriate loss gradients.Theorem 1Under typical green building operational conditions where energy systems exhibit regular behavioral patterns and environmental monitoring provides reliable data, the Seq2Seq-RL framework demonstrates convergent learning behavior that progressively improves prediction accuracy while maintaining adaptation capability across varying climate conditions. The practical convergence characteristics can be observed through the stabilization of prediction errors and policy performance over training iterations, as validated in real-world deployment scenarios.$$\begin{aligned} \lim _{t \rightarrow \infty } \mathbb {E}\left[ |\hat{E}_t - E_t|^2\right] \le \epsilon _{practical} \end{aligned}$$(15)where \(\epsilon _{practical}\) represents the achievable prediction accuracy under normal operational conditions, determined by sensor precision, environmental variability, and building system complexity rather than theoretical mathematical constraints.Corollary 1For practical deployment in green building energy management systems, the Seq2Seq-RL framework achieves prediction performance that scales favorably with data quality and temporal coverage. The performance characteristics observed in real-world applications can be described by:$$\begin{aligned} \mathbb {E}\left[ |\hat{E}_{t+1} - E_{t+1}|^2\right] \le C_{base} + \frac{C_{data}}{\text {DataQuality}} + C_{adapt} \exp (-\alpha \cdot \text {TrainingTime}) \end{aligned}$$(16)where \(C_{base}\) represents the baseline prediction error achievable with the framework, \(C_{data}\) captures the impact of data availability and sensor accuracy, \(C_{adapt}\) quantifies the initial adaptation period, and \(\alpha> 0\) represents the empirically observed learning rate across different building environments.The combination of LSTM and attention mechanism transfer learning motivationAlthough traditional building energy consumption prediction models can handle a certain range of prediction tasks, their performance is often unsatisfactory when faced with long-term time series dependencies of data and complex dynamic environmental changes21,22. Especially, it is difficult to effectively distinguish and focus on key time points and factors that affect energy consumption, resulting in limited accuracy and reliability of prediction results23.To address the challenges, this study designed a composite model that integrates LSTM, attention mechanism, and transfer learning. This model effectively captures the time series characteristics and long-term dependencies of energy consumption through LSTM, and the introduced attention mechanism can automatically highlight data points that have a significant impact on prediction, improving prediction accuracy and efficiency. The innovation lies in the fact that through transfer learning, the model borrows knowledge from other tasks, enhancing its adaptability and generalization ability to new environments and different types of buildings.Figure 3 illustrates the composite model designed in this study, which integrates LSTM, the attention mechanism, and transfer learning. The figure shows how the LSTM effectively captures the time series characteristics and long-term dependencies of energy consumption. Additionally, the attention mechanism is depicted highlighting the data points that significantly impact predictions, thereby improving accuracy and efficiency. The figure also emphasizes the innovation of transfer learning, which allows the model to borrow knowledge from other tasks, enhancing its adaptability and generalization to new environments and different types of buildings.Fig. 3Architecture of the composite prediction model integrating LSTM, attention mechanism, and transfer learning.Full size imageMathematical formulation of LSTM-attention-transfer frameworkThe LSTM-Attention-Transfer framework builds upon the foundation established by the Seq2Seq-RL component to address cross-domain generalization and enhanced feature extraction challenges. This framework leverages the initial predictions from the first stage while incorporating sophisticated attention mechanisms and transfer learning strategies to optimize prediction performance across diverse building types and climate zones.The enhanced LSTM architecture incorporates climate-aware gating mechanisms that dynamically adjust information flow based on environmental conditions and building characteristics. The gate computations integrate multiple information sources to capture complex temporal dependencies:$$\begin{aligned} \begin{aligned} \textbf{f}_t&= \sigma \left( \textbf{W}_f \left[ \begin{array}{c} \textbf{h}_{t-1} \\ \textbf{x}_t \\ \hat{E}_{t}^{stage1} \end{array}\right] + \textbf{U}_f \textbf{W}_t + \textbf{V}_f \tanh (\textbf{W}_{climate} \textbf{W}_t) + \textbf{b}_f\right) , \\ \textbf{i}_t&= \sigma \left( \textbf{W}_i \left[ \begin{array}{c} \textbf{h}_{t-1} \\ \textbf{x}_t \\ \hat{E}_{t}^{stage1} \end{array}\right] + \textbf{U}_i \textbf{W}_t + \textbf{V}_i \text {softmax}(\textbf{W}_{domain} \textbf{d}_t) + \textbf{b}_i\right) , \\ \tilde{\textbf{c}}_t&= \tanh \left( \textbf{W}_c \left[ \begin{array}{c} \textbf{h}_{t-1} \\ \textbf{x}_t \\ \hat{E}_{t}^{stage1} \end{array}\right] + \textbf{U}_c \textbf{W}_t + \textbf{V}_c \textbf{r}_t^{transfer} + \textbf{b}_c\right) , \\ \textbf{c}_t&= \textbf{f}_t \odot \textbf{c}_{t-1} + \textbf{i}_t \odot \tilde{\textbf{c}}_t + \alpha _{residual} \textbf{W}_{res} \hat{E}_{t}^{stage1} \end{aligned} \end{aligned}$$(17)where \(\hat{E}_{t}^{stage1}\) represents predictions from the first stage, \(\textbf{W}_t\) denotes weather conditions, \(\textbf{d}_t\) represents domain-specific features, \(\textbf{r}_t^{transfer}\) contains transferred knowledge representations, and \(\alpha _{residual}\) controls the residual connection strength to integrate first-stage predictions.Multi-head attention mechanisms capture complex relationships between temporal features, climate patterns, and domain characteristics while maintaining interpretability for practical deployment. The attention computation incorporates domain-aware scaling and climate sensitivity:$$\begin{aligned} \begin{aligned} \text {MultiHead}(\textbf{Q}, \textbf{K}, \textbf{V})&= \text {Concat}(\text {head}_1, \ldots , \text {head}_h) \textbf{W}^O, \\ \text {head}_i&= \text {Attention}(\textbf{Q}\textbf{W}_i^Q, \textbf{K}\textbf{W}_i^K, \textbf{V}\textbf{W}_i^V), \\ \text {Attention}(\textbf{Q}, \textbf{K}, \textbf{V})&= \text {softmax}\left( \frac{\textbf{Q}\textbf{K}^T}{\sqrt{d_k}} \odot \textbf{M}_{climate} + \textbf{B}_{domain}\right) \textbf{V}, \\ \textbf{M}_{climate}[i,j]&= \exp \left( -\gamma _{climate} \Vert \textbf{W}_i - \textbf{W}_j\Vert _2^2\right) , \\ \textbf{B}_{domain}[i,j]&= \beta _{domain} \cdot \text {sim}(\textbf{d}_i, \textbf{d}_j) \cdot \log (1 + |t_i - t_j|) \end{aligned} \end{aligned}$$(18)where h is the number of attention heads, \(\textbf{M}_{climate}\) provides climate-based attention modulation, \(\textbf{B}_{domain}\) incorporates domain similarity and temporal distance, \(\gamma _{climate}\) and \(\beta _{domain}\) are scaling parameters, and \(\text {sim}(\cdot , \cdot )\) computes domain similarity using learned embeddings.Transfer learning mechanisms enable knowledge sharing across different building types and geographic locations through domain adaptation and feature alignment. The framework employs adversarial domain adaptation with gradient reversal to learn domain-invariant representations:$$\begin{aligned} \begin{aligned} \textbf{z}_{shared}&= \text {Encoder}_{shared}(\textbf{h}_t^{LSTM}, \textbf{att}_t), \\ \textbf{z}_{source}&= \text {Encoder}_{source}(\textbf{z}_{shared}) + \lambda _{residual} \textbf{W}_{src} \hat{E}_{t}^{stage1}, \\ \textbf{z}_{target}&= \text {Encoder}_{target}(\textbf{z}_{shared}) + \lambda _{residual} \textbf{W}_{tgt} \hat{E}_{t}^{stage1}, \\ \mathscr {L}_{domain}&= -\sum _{i=1}^{N_{src}} \log D(\textbf{z}_{shared}^{(i)}) - \sum _{j=1}^{N_{tgt}} \log (1 - D(\textbf{z}_{shared}^{(j)})) + \lambda _{gradient} \Vert \nabla _{\textbf{z}_{shared}} D(\textbf{z}_{shared})\Vert _2^2, \\ \mathscr {L}_{alignment}&= \text {MMD}(p(\textbf{z}_{source}), p(\textbf{z}_{target})) + \alpha _{wasserstein} \cdot W_2(p(\textbf{z}_{source}), p(\textbf{z}_{target})) \end{aligned} \end{aligned}$$(19)where \(D(\cdot )\) is the domain discriminator, \(\text {MMD}(\cdot , \cdot )\) represents Maximum Mean Discrepancy, \(W_2(\cdot , \cdot )\) denotes the 2-Wasserstein distance, \(N_{src}\) and \(N_{tgt}\) are source and target domain sample sizes, and \(\lambda _{gradient}\), \(\alpha _{wasserstein}\) control regularization strength.Climate-adaptive feature extraction incorporates seasonal patterns, weather variability, and long-term climate trends to enhance prediction robustness. The feature extraction mechanism combines multiple temporal scales and climate-specific representations:$$\begin{aligned} \begin{aligned} \textbf{f}_{climate}^{(s)}&= \sum _{k=1}^{K} \alpha _k^{(s)} \cdot \text {Conv1D}_k\left( \textbf{W}_{1:T}, \text {kernel}_k\right) \odot \text {sigmoid}(\textbf{W}_{season} \textbf{s}_t), \\ \textbf{f}_{climate}^{(l)}&= \text {GRU}\left( \textbf{f}_{climate}^{(s)}, \textbf{h}_{climate,t-1}\right) + \sum _{j=1}^{J} \beta _j \cdot \text {DilatedConv}_j(\textbf{W}_{1:T}, \text {dilation}_j), \\ \textbf{f}_{integrated}&= \text {LayerNorm}\left( \textbf{f}_{climate}^{(l)} + \textbf{W}_{proj} \left[ \begin{array}{c} \textbf{z}_{shared} \\ \hat{E}_{t}^{stage1} \\ \text {FFT}(\textbf{W}_{1:T}) \end{array}\right] \right) , \\ \textbf{f}_{final}&= \text {MLP}\left( \textbf{f}_{integrated}\right) + \gamma _{skip} \cdot \textbf{W}_{skip} \hat{E}_{t}^{stage1} \end{aligned} \end{aligned}$$(20)where \(\alpha _k^{(s)}\) and \(\beta _j\) are learnable coefficients, \(\textbf{s}_t\) represents seasonal indicators, \(\text {Conv1D}_k\) and \(\text {DilatedConv}_j\) capture different temporal patterns, \(\text {FFT}(\cdot )\) provides frequency domain features, and \(\gamma _{skip}\) controls skip connection strength.Cross-domain knowledge distillation facilitates effective transfer of prediction strategies while preserving domain-specific characteristics. The distillation process incorporates attention-guided knowledge transfer and uncertainty-aware weighting:$$\begin{aligned} \begin{aligned} \mathscr {L}_{distill}&= \sum _{t=1}^{T} \sum _{d=1}^{D} w_d^{(t)} \cdot \text {KL}\left( \text {softmax}\left( \frac{\textbf{z}_{teacher}^{(d)}}{T_{temp}}\right) , \text {softmax}\left( \frac{\textbf{z}_{student}^{(d)}}{T_{temp}}\right) \right) , \\ w_d^{(t)}&= \frac{\exp \left( -\mathscr {U}_d^{(t)} / \tau _{uncertainty}\right) }{\sum _{d'=1}^{D} \exp \left( -\mathscr {U}_{d'}^{(t)} / \tau _{uncertainty}\right) } \cdot \left( 1 + \text {sim}(\textbf{d}_{target}, \textbf{d}_d)\right) , \\ \mathscr {U}_d^{(t)}&= \frac{1}{M} \sum _{m=1}^{M} \left( \hat{E}_{t,m}^{(d)} - \bar{E}_t^{(d)}\right) ^2, \quad \bar{E}_t^{(d)} = \frac{1}{M} \sum _{m=1}^{M} \hat{E}_{t,m}^{(d)}, \\ \mathscr {L}_{consistency}&= \sum _{t=1}^{T} \left\| \text {Attention}(\textbf{f}_{final}^{teacher}) - \text {Attention}(\textbf{f}_{final}^{student})\right\| _F^2 \end{aligned} \end{aligned}$$(21)where \(T_{temp}\) is the temperature parameter for softmax, \(\mathscr {U}_d^{(t)}\) represents prediction uncertainty, M is the number of Monte Carlo samples, \(\tau _{uncertainty}\) controls uncertainty sensitivity, and \(\Vert \cdot \Vert _F\) denotes the Frobenius norm.The comprehensive loss function integrates prediction accuracy, domain adaptation, attention consistency, and regularization objectives to ensure robust performance across diverse operational scenarios:$$\begin{aligned} \begin{aligned} \mathscr {L}_{total}&= \mathscr {L}_{prediction} + \lambda _{domain} \mathscr {L}_{domain} + \lambda _{distill} \mathscr {L}_{distill} + \lambda _{consistency} \mathscr {L}_{consistency} + \lambda _{reg} \mathscr {L}_{regularization}, \\ \mathscr {L}_{prediction}&= \sum _{t=1}^{T} \left[ \Vert \hat{E}_t^{final} - E_t\Vert _2^2 + \alpha _{stage1} \Vert \hat{E}_t^{final} - \hat{E}_t^{stage1}\Vert _2^2\right] \cdot \text {exp}\left( -\frac{\Vert \textbf{W}_t - \overline{\textbf{W}}\Vert _2^2}{2\sigma _{climate}^2}\right) , \\ \mathscr {L}_{regularization}&= \sum _{i} \Vert \textbf{W}_i\Vert _2^2 + \beta _{entropy} \sum _{t} \mathscr {H}(\text {Attention}(\textbf{f}_{final})) + \beta _{smooth} \sum _{t} \Vert \hat{E}_{t+1}^{final} - \hat{E}_t^{final}\Vert _2^2 \end{aligned} \end{aligned}$$(22)where \(\hat{E}_t^{final}\) represents the final prediction, \(\overline{\textbf{W}}\) is the mean weather condition, \(\sigma _{climate}^2\) controls climate sensitivity, \(\mathscr {H}(\cdot )\) denotes entropy for attention diversity, and \(\beta _{entropy}\), \(\beta _{smooth}\) are regularization coefficients.Adaptive optimization strategies incorporate momentum-based updates with climate-aware learning rate scheduling and gradient stabilization to ensure convergence across different domains and seasons:$$\begin{aligned} \begin{aligned} \textbf{g}_t&= \nabla _{\theta } \mathscr {L}_{total} + \lambda _{climate} \nabla _{\theta } \mathscr {R}_{climate}(\theta , \textbf{W}_t), \\ \textbf{m}_t&= \beta _1 \textbf{m}_{t-1} + (1-\beta _1) \textbf{g}_t, \quad \textbf{v}_t = \beta _2 \textbf{v}_{t-1} + (1-\beta _2) \textbf{g}_t^2, \\ \eta _t^{adaptive}&= \eta _0 \cdot \sqrt{\frac{1 + \cos (\pi \cdot \text {epoch} / \text {max\_epochs})}{2}} \cdot \left( 1 + \alpha _{climate} \Vert \textbf{W}_t - \textbf{W}_{t-1}\Vert _2\right) ^{-1}, \\ \theta _{t+1}&= \theta _t - \eta _t^{adaptive} \frac{\textbf{m}_t}{\sqrt{\textbf{v}_t} + \epsilon } \cdot \text {clip}\left( \frac{\textbf{g}_t}{\Vert \textbf{g}_t\Vert _2}, -C_{clip}, C_{clip}\right) , \\ \mathscr {R}_{climate}(\theta , \textbf{W}_t)&= \sum _{k} \exp \left( -\gamma _k \Vert \textbf{W}_t - \textbf{W}_{ref,k}\Vert _2^2\right) \cdot \Vert \theta _k - \theta _{ref,k}\Vert _2^2 \end{aligned} \end{aligned}$$(23)where \(\mathscr {R}_{climate}\) provides climate-specific regularization, \(\eta _0\) is the base learning rate, \(\alpha _{climate}\) controls climate adaptation, \(C_{clip}\) is the gradient clipping threshold, and \(\theta _{ref,k}\), \(\textbf{W}_{ref,k}\) represent reference parameters and weather conditions for different climate zones.Theorem 2Under standard green building operational scenarios with adequate source domain data, the LSTM-Attention-Transfer framework achieves effective domain adaptation with prediction performance that benefits from cross-domain knowledge transfer. The adaptation effectiveness can be characterized by consistent performance improvements across different building types and climate zones, as demonstrated through comprehensive empirical validation.$$\begin{aligned} \mathbb {E}_{target}\left[ |\hat{E}_t^{final} - E_t|^2\right] \le \min _{d \in \{1,\ldots ,D\}} \mathbb {E}_{source,d}\left[ |\hat{E}_t^{stage1} - E_t|^2\right] + \Delta _{transfer} + \epsilon _{domain} \end{aligned}$$(24)where \(\Delta _{transfer}\) represents the adaptation overhead that diminishes with sufficient target domain exposure, \(\epsilon _{domain}\) captures the inherent prediction challenge due to domain differences, and the inequality reflects the practical performance bounds observed in real-world transfer scenarios.Corollary 2For practical deployment scenarios with limited target domain data, the framework maintains robust prediction accuracy through effective knowledge transfer, with performance characteristics given by:$$\begin{aligned} \mathbb {E}_{target}\left[ |\hat{E}_t^{final} - E_t|^2\right] \le \mathbb {E}_{target}\left[ |\hat{E}_t^{stage1} - E_t|^2\right] - \Delta _{improvement} + O\left( \frac{1}{\sqrt{N_{target}}}\right) \end{aligned}$$(25)where \(\Delta _{improvement}> 0\) quantifies the empirically observed enhancement from the second stage processing, and the \(O(1/\sqrt{N_{target}})\) term reflects the standard statistical learning behavior with respect to target domain sample size, consistent with practical machine learning deployment patterns.