Skip to content

DokiforU/Temperature_Predict

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Global Temperature Prediction with Hybrid Time Series Models

Python Pandas Scikit-learn TensorFlow XGBoost License

This repository contains a comprehensive analysis and forecasting of global temperature anomalies using the HadCRUT5 dataset. The project addresses the critical question of global warming by developing two sophisticated, integrated time series models to describe past trends and predict future temperature levels, including forecasts for 2030, 2050, and beyond.

The entire workflow, from data exploration to model evaluation and visualization, is meticulously implemented in Python, emphasizing advanced techniques and a rigorous, step-by-step analytical process.


🎯 Project Goals

  1. Investigate Global Warming: To analyze historical temperature data and determine if it supports the narrative of global warming.
  2. Advanced Hybrid Modeling: To build and compare two distinct, state-of-the-art hybrid models for time series forecasting:
    • Model 1: A SARIMA + XGBoost hybrid model.
    • Model 2: An STL Decomposition + LSTM hybrid model.
  3. Future Temperature Prediction: To use both models to forecast global average temperatures for 2030 and 2050, and to estimate when the global average temperature might reach a critical threshold of 20.00°C.
  4. Comprehensive Visualization: To generate high-quality (600 DPI), intuitive visualizations for every stage of the process, from EDA to final model comparison and long-term forecasts.
  5. Model Evaluation: To rigorously assess and compare the performance of both models to determine which is more accurate and reliable for this task.

🔬 Methodology & Pipeline

The project is structured as a sequential pipeline, with each script building upon the results of the previous one.

1. Exploratory Data Analysis & Preprocessing (eda_preprocessing.py)

This initial stage focuses on understanding and preparing the data for modeling.

  • Data Loading & Validation: Loads the CSV, parses time-series data, and performs initial integrity checks.
  • Missing Value Imputation: Detects and imputes missing values using linear interpolation, with visualizations to compare the before-and-after series.
  • STL Decomposition: Decomposes the time series into its Trend, Seasonal, and Residual components using Seasonal-Trend-Loess (STL) decomposition for deeper insight.
  • Stationarity Analysis: Conducts rigorous stationarity tests (ADF, KPSS, Phillips-Perron) on the original series, its components, and the differenced series.
  • ACF/PACF Analysis: Plots Autocorrelation and Partial Autocorrelation functions to help identify potential parameters for time series models.
  • Data Splitting & Scaling: Splits the data into training and testing sets and applies appropriate scaling (MinMaxScaler, RobustScaler) to different components in preparation for modeling.
  • Saving Artifacts: All processed data, scalers, and metadata are saved using joblib for use in the modeling scripts.

2. Model 1: Hybrid SARIMA-XGBoost (model_1_sarima_xgboost.py)

This model combines a classic statistical model with a powerful gradient boosting machine.

  • SARIMA Modeling: pmdarima.auto_arima is used to find the optimal parameters for a SARIMA model, which is then fitted on the training data to capture the primary linear trends and seasonality.
  • Residual Modeling with XGBoost: An XGBoost model is trained on the residuals of the SARIMA model. This allows it to capture the complex, non-linear patterns that SARIMA missed.
  • Advanced Feature Engineering: A rich set of features are created for XGBoost, including lags, rolling statistics, calendar features, and Fourier terms.
  • Hyperparameter Tuning: Optuna is used for Bayesian optimization to find the best hyperparameters for the XGBoost model through time-series cross-validation.
  • Model Explainability: SHAP (SHapley Additive exPlanations) is used to analyze feature importance and understand the XGBoost model's predictions.
  • Uncertainty Quantification: A Block Bootstrap method is implemented to generate robust 95% prediction intervals for the hybrid forecast.

3. Model 2: Hybrid STL-LSTM (model_2_stl_lstm.py)

This model leverages deep learning for forecasting the decomposed components.

  • Decomposition-First Approach: The model works on the Trend and Seasonal components generated by the STL decomposition in the first script.
  • LSTM for Trend Forecasting: A Long Short-Term Memory (LSTM) neural network is trained specifically to model the long-term trend component of the time series.
  • Hyperparameter Tuning: Optuna is used with TimeSeriesSplit to find the optimal architecture and training parameters for the LSTM model.
  • Naive Seasonal Forecasting: The seasonal component is forecasted by repeating the last observed seasonal cycle.
  • Hybrid Assembly: The final forecast is a combination of the LSTM trend forecast and the naive seasonal forecast.
  • Uncertainty Quantification: A Bootstrap method is applied to the training residuals of the final hybrid model to estimate prediction intervals.

📈 Key Visualizations Generated

The project produces a rich set of high-quality plots to document the entire analysis, including:

  • Global Temperature Anomaly Time Series with Confidence Intervals
  • Missing Value Matrix and Imputation Comparison
  • STL Decomposition Plots (Trend, Seasonal, Residual)
  • Differenced Time Series Plots
  • ACF & PACF Plots for Stationarity Analysis
  • Train/Test Split Visualization
  • SARIMA & XGBoost Model Fit on Training Data
  • Comprehensive Residual Analysis Plots (Time Series, ACF/PACF, Histogram, Q-Q Plot)
  • XGBoost Feature Importance & SHAP Summary Plots
  • Hybrid Model Forecasts vs. Actuals with Prediction Intervals
  • Final Long-term Forecast Plot (up to 2055) showing the 20°C threshold crossing.
  • Model Performance Comparison Bar Chart.

⚙️ Technology Stack

  • Data Handling & Numerics: Pandas, NumPy
  • Time Series Analysis: Statsmodels, pmdarima, arch
  • Machine Learning: Scikit-learn, XGBoost
  • Deep Learning: TensorFlow 2.x (Keras)
  • Hyperparameter Optimization: Optuna
  • Model Explainability: SHAP
  • Visualization: Matplotlib, Seaborn
  • Serialization: Joblib

🚀 How to Run the Project

The project is designed to be run as a sequence of scripts.

1. Prerequisites

  • Python 3.10 or higher.
  • A virtual environment is highly recommended.

2. Installation

  1. Clone the repository:

    git clone [https://github.com/DokiforU/Temperature_Predict.git](https://github.com/DokiforU/Temperature_Predict.git)
    cd Temperature_Predict
  2. Create and activate a virtual environment:

    # On macOS/Linux
    python3 -m venv venv
    source venv/bin/activate
    
    # On Windows
    python -m venv venv
    .\venv\Scripts\activate
  3. Generate and install dependencies: First, ensure you have run your scripts once locally to have all packages installed in your environment.

    pip freeze > requirements.txt
    pip install -r requirements.txt

3. Execution Order

You must run the scripts in the following order, as they depend on each other's output.

  1. Run the EDA and Preprocessing script first:

    python eda_preprocessing.py

    This will generate the processed_data/ directory with the necessary files for the models.

  2. Run the first model:

    python model_1_sarima_xgboost.py

    This will train the SARIMA-XGBoost model and save all its outputs and plots to the model_1_output/ directory.

  3. Run the second model:

    python model_2_stl_lstm.py

    This will train the STL-LSTM model and save all its outputs and plots to the model_2_output/ directory.

4. Outputs

All generated artifacts, including visualizations (as PNG files), trained models, scalers, and final prediction data (as .joblib files), will be saved into their respective output directories (model_1_output/, model_2_output/). The terminal logs will provide detailed, real-time information about each step of the process.

📄 License

This project is distributed under the MIT License.

About

Personal use code for Southwest University 2025 Mathematical Modeling Competition.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages