This repository contains a comprehensive analysis and forecasting of global temperature anomalies using the HadCRUT5 dataset. The project addresses the critical question of global warming by developing two sophisticated, integrated time series models to describe past trends and predict future temperature levels, including forecasts for 2030, 2050, and beyond.
The entire workflow, from data exploration to model evaluation and visualization, is meticulously implemented in Python, emphasizing advanced techniques and a rigorous, step-by-step analytical process.
- Investigate Global Warming: To analyze historical temperature data and determine if it supports the narrative of global warming.
- Advanced Hybrid Modeling: To build and compare two distinct, state-of-the-art hybrid models for time series forecasting:
- Model 1: A SARIMA + XGBoost hybrid model.
- Model 2: An STL Decomposition + LSTM hybrid model.
- Future Temperature Prediction: To use both models to forecast global average temperatures for 2030 and 2050, and to estimate when the global average temperature might reach a critical threshold of 20.00°C.
- Comprehensive Visualization: To generate high-quality (600 DPI), intuitive visualizations for every stage of the process, from EDA to final model comparison and long-term forecasts.
- Model Evaluation: To rigorously assess and compare the performance of both models to determine which is more accurate and reliable for this task.
The project is structured as a sequential pipeline, with each script building upon the results of the previous one.
This initial stage focuses on understanding and preparing the data for modeling.
- Data Loading & Validation: Loads the CSV, parses time-series data, and performs initial integrity checks.
- Missing Value Imputation: Detects and imputes missing values using linear interpolation, with visualizations to compare the before-and-after series.
- STL Decomposition: Decomposes the time series into its Trend, Seasonal, and Residual components using Seasonal-Trend-Loess (STL) decomposition for deeper insight.
- Stationarity Analysis: Conducts rigorous stationarity tests (ADF, KPSS, Phillips-Perron) on the original series, its components, and the differenced series.
- ACF/PACF Analysis: Plots Autocorrelation and Partial Autocorrelation functions to help identify potential parameters for time series models.
- Data Splitting & Scaling: Splits the data into training and testing sets and applies appropriate scaling (
MinMaxScaler,RobustScaler) to different components in preparation for modeling. - Saving Artifacts: All processed data, scalers, and metadata are saved using
joblibfor use in the modeling scripts.
This model combines a classic statistical model with a powerful gradient boosting machine.
- SARIMA Modeling:
pmdarima.auto_arimais used to find the optimal parameters for a SARIMA model, which is then fitted on the training data to capture the primary linear trends and seasonality. - Residual Modeling with XGBoost: An XGBoost model is trained on the residuals of the SARIMA model. This allows it to capture the complex, non-linear patterns that SARIMA missed.
- Advanced Feature Engineering: A rich set of features are created for XGBoost, including lags, rolling statistics, calendar features, and Fourier terms.
- Hyperparameter Tuning: Optuna is used for Bayesian optimization to find the best hyperparameters for the XGBoost model through time-series cross-validation.
- Model Explainability: SHAP (SHapley Additive exPlanations) is used to analyze feature importance and understand the XGBoost model's predictions.
- Uncertainty Quantification: A Block Bootstrap method is implemented to generate robust 95% prediction intervals for the hybrid forecast.
This model leverages deep learning for forecasting the decomposed components.
- Decomposition-First Approach: The model works on the Trend and Seasonal components generated by the STL decomposition in the first script.
- LSTM for Trend Forecasting: A Long Short-Term Memory (LSTM) neural network is trained specifically to model the long-term trend component of the time series.
- Hyperparameter Tuning: Optuna is used with
TimeSeriesSplitto find the optimal architecture and training parameters for the LSTM model. - Naive Seasonal Forecasting: The seasonal component is forecasted by repeating the last observed seasonal cycle.
- Hybrid Assembly: The final forecast is a combination of the LSTM trend forecast and the naive seasonal forecast.
- Uncertainty Quantification: A Bootstrap method is applied to the training residuals of the final hybrid model to estimate prediction intervals.
The project produces a rich set of high-quality plots to document the entire analysis, including:
- Global Temperature Anomaly Time Series with Confidence Intervals
- Missing Value Matrix and Imputation Comparison
- STL Decomposition Plots (Trend, Seasonal, Residual)
- Differenced Time Series Plots
- ACF & PACF Plots for Stationarity Analysis
- Train/Test Split Visualization
- SARIMA & XGBoost Model Fit on Training Data
- Comprehensive Residual Analysis Plots (Time Series, ACF/PACF, Histogram, Q-Q Plot)
- XGBoost Feature Importance & SHAP Summary Plots
- Hybrid Model Forecasts vs. Actuals with Prediction Intervals
- Final Long-term Forecast Plot (up to 2055) showing the 20°C threshold crossing.
- Model Performance Comparison Bar Chart.
- Data Handling & Numerics: Pandas, NumPy
- Time Series Analysis: Statsmodels, pmdarima, arch
- Machine Learning: Scikit-learn, XGBoost
- Deep Learning: TensorFlow 2.x (Keras)
- Hyperparameter Optimization: Optuna
- Model Explainability: SHAP
- Visualization: Matplotlib, Seaborn
- Serialization: Joblib
The project is designed to be run as a sequence of scripts.
- Python 3.10 or higher.
- A virtual environment is highly recommended.
-
Clone the repository:
git clone [https://github.com/DokiforU/Temperature_Predict.git](https://github.com/DokiforU/Temperature_Predict.git) cd Temperature_Predict -
Create and activate a virtual environment:
# On macOS/Linux python3 -m venv venv source venv/bin/activate # On Windows python -m venv venv .\venv\Scripts\activate
-
Generate and install dependencies: First, ensure you have run your scripts once locally to have all packages installed in your environment.
pip freeze > requirements.txt pip install -r requirements.txt
You must run the scripts in the following order, as they depend on each other's output.
-
Run the EDA and Preprocessing script first:
python eda_preprocessing.py
This will generate the
processed_data/directory with the necessary files for the models. -
Run the first model:
python model_1_sarima_xgboost.py
This will train the SARIMA-XGBoost model and save all its outputs and plots to the
model_1_output/directory. -
Run the second model:
python model_2_stl_lstm.py
This will train the STL-LSTM model and save all its outputs and plots to the
model_2_output/directory.
All generated artifacts, including visualizations (as PNG files), trained models, scalers, and final prediction data (as .joblib files), will be saved into their respective output directories (model_1_output/, model_2_output/). The terminal logs will provide detailed, real-time information about each step of the process.
This project is distributed under the MIT License.