Global Temperature Prediction with Hybrid Time Series Models

This repository contains a comprehensive analysis and forecasting of global temperature anomalies using the HadCRUT5 dataset. The project addresses the critical question of global warming by developing two sophisticated, integrated time series models to describe past trends and predict future temperature levels, including forecasts for 2030, 2050, and beyond.

The entire workflow, from data exploration to model evaluation and visualization, is meticulously implemented in Python, emphasizing advanced techniques and a rigorous, step-by-step analytical process.

🎯 Project Goals

Investigate Global Warming: To analyze historical temperature data and determine if it supports the narrative of global warming.
Advanced Hybrid Modeling: To build and compare two distinct, state-of-the-art hybrid models for time series forecasting:
- Model 1: A SARIMA + XGBoost hybrid model.
- Model 2: An STL Decomposition + LSTM hybrid model.
Future Temperature Prediction: To use both models to forecast global average temperatures for 2030 and 2050, and to estimate when the global average temperature might reach a critical threshold of 20.00°C.
Comprehensive Visualization: To generate high-quality (600 DPI), intuitive visualizations for every stage of the process, from EDA to final model comparison and long-term forecasts.
Model Evaluation: To rigorously assess and compare the performance of both models to determine which is more accurate and reliable for this task.

🔬 Methodology & Pipeline

The project is structured as a sequential pipeline, with each script building upon the results of the previous one.

1. Exploratory Data Analysis & Preprocessing (`eda_preprocessing.py`)

This initial stage focuses on understanding and preparing the data for modeling.

Data Loading & Validation: Loads the CSV, parses time-series data, and performs initial integrity checks.
Missing Value Imputation: Detects and imputes missing values using linear interpolation, with visualizations to compare the before-and-after series.
STL Decomposition: Decomposes the time series into its Trend, Seasonal, and Residual components using Seasonal-Trend-Loess (STL) decomposition for deeper insight.
Stationarity Analysis: Conducts rigorous stationarity tests (ADF, KPSS, Phillips-Perron) on the original series, its components, and the differenced series.
ACF/PACF Analysis: Plots Autocorrelation and Partial Autocorrelation functions to help identify potential parameters for time series models.
Data Splitting & Scaling: Splits the data into training and testing sets and applies appropriate scaling (MinMaxScaler, RobustScaler) to different components in preparation for modeling.
Saving Artifacts: All processed data, scalers, and metadata are saved using joblib for use in the modeling scripts.

2. Model 1: Hybrid SARIMA-XGBoost (`model_1_sarima_xgboost.py`)

This model combines a classic statistical model with a powerful gradient boosting machine.

SARIMA Modeling: pmdarima.auto_arima is used to find the optimal parameters for a SARIMA model, which is then fitted on the training data to capture the primary linear trends and seasonality.
Residual Modeling with XGBoost: An XGBoost model is trained on the residuals of the SARIMA model. This allows it to capture the complex, non-linear patterns that SARIMA missed.
Advanced Feature Engineering: A rich set of features are created for XGBoost, including lags, rolling statistics, calendar features, and Fourier terms.
Hyperparameter Tuning: Optuna is used for Bayesian optimization to find the best hyperparameters for the XGBoost model through time-series cross-validation.
Model Explainability: SHAP (SHapley Additive exPlanations) is used to analyze feature importance and understand the XGBoost model's predictions.
Uncertainty Quantification: A Block Bootstrap method is implemented to generate robust 95% prediction intervals for the hybrid forecast.

3. Model 2: Hybrid STL-LSTM (`model_2_stl_lstm.py`)

This model leverages deep learning for forecasting the decomposed components.

Decomposition-First Approach: The model works on the Trend and Seasonal components generated by the STL decomposition in the first script.
LSTM for Trend Forecasting: A Long Short-Term Memory (LSTM) neural network is trained specifically to model the long-term trend component of the time series.
Hyperparameter Tuning: Optuna is used with TimeSeriesSplit to find the optimal architecture and training parameters for the LSTM model.
Naive Seasonal Forecasting: The seasonal component is forecasted by repeating the last observed seasonal cycle.
Hybrid Assembly: The final forecast is a combination of the LSTM trend forecast and the naive seasonal forecast.
Uncertainty Quantification: A Bootstrap method is applied to the training residuals of the final hybrid model to estimate prediction intervals.

📈 Key Visualizations Generated

The project produces a rich set of high-quality plots to document the entire analysis, including:

Global Temperature Anomaly Time Series with Confidence Intervals
Missing Value Matrix and Imputation Comparison
STL Decomposition Plots (Trend, Seasonal, Residual)
Differenced Time Series Plots
ACF & PACF Plots for Stationarity Analysis
Train/Test Split Visualization
SARIMA & XGBoost Model Fit on Training Data
Comprehensive Residual Analysis Plots (Time Series, ACF/PACF, Histogram, Q-Q Plot)
XGBoost Feature Importance & SHAP Summary Plots
Hybrid Model Forecasts vs. Actuals with Prediction Intervals
Final Long-term Forecast Plot (up to 2055) showing the 20°C threshold crossing.
Model Performance Comparison Bar Chart.

⚙️ Technology Stack

Data Handling & Numerics: Pandas, NumPy
Time Series Analysis: Statsmodels, pmdarima, arch
Machine Learning: Scikit-learn, XGBoost
Deep Learning: TensorFlow 2.x (Keras)
Hyperparameter Optimization: Optuna
Model Explainability: SHAP
Visualization: Matplotlib, Seaborn
Serialization: Joblib

🚀 How to Run the Project

The project is designed to be run as a sequence of scripts.

1. Prerequisites

Python 3.10 or higher.
A virtual environment is highly recommended.

2. Installation

Clone the repository:

git clone [https://github.com/DokiforU/Temperature_Predict.git](https://github.com/DokiforU/Temperature_Predict.git)
cd Temperature_Predict

Create and activate a virtual environment:

# On macOS/Linux
python3 -m venv venv
source venv/bin/activate

# On Windows
python -m venv venv
.\venv\Scripts\activate

Generate and install dependencies: First, ensure you have run your scripts once locally to have all packages installed in your environment.
```
pip freeze > requirements.txt
pip install -r requirements.txt
```

3. Execution Order

You must run the scripts in the following order, as they depend on each other's output.

Run the EDA and Preprocessing script first:
```
python eda_preprocessing.py
```
This will generate the processed_data/ directory with the necessary files for the models.
Run the first model:
```
python model_1_sarima_xgboost.py
```
This will train the SARIMA-XGBoost model and save all its outputs and plots to the model_1_output/ directory.
Run the second model:
```
python model_2_stl_lstm.py
```
This will train the STL-LSTM model and save all its outputs and plots to the model_2_output/ directory.

4. Outputs

All generated artifacts, including visualizations (as PNG files), trained models, scalers, and final prediction data (as .joblib files), will be saved into their respective output directories (model_1_output/, model_2_output/). The terminal logs will provide detailed, real-time information about each step of the process.

📄 License

This project is distributed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
eda_preprocessing.py		eda_preprocessing.py
model_1_sarima_xgboost.py		model_1_sarima_xgboost.py
model_2_stl_lstm.py		model_2_stl_lstm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Global Temperature Prediction with Hybrid Time Series Models

🎯 Project Goals

🔬 Methodology & Pipeline

1. Exploratory Data Analysis & Preprocessing (`eda_preprocessing.py`)

2. Model 1: Hybrid SARIMA-XGBoost (`model_1_sarima_xgboost.py`)

3. Model 2: Hybrid STL-LSTM (`model_2_stl_lstm.py`)

📈 Key Visualizations Generated

⚙️ Technology Stack

🚀 How to Run the Project

1. Prerequisites

2. Installation

3. Execution Order

4. Outputs

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Global Temperature Prediction with Hybrid Time Series Models

🎯 Project Goals

🔬 Methodology & Pipeline

1. Exploratory Data Analysis & Preprocessing (eda_preprocessing.py)

2. Model 1: Hybrid SARIMA-XGBoost (model_1_sarima_xgboost.py)

3. Model 2: Hybrid STL-LSTM (model_2_stl_lstm.py)

📈 Key Visualizations Generated

⚙️ Technology Stack

🚀 How to Run the Project

1. Prerequisites

2. Installation

3. Execution Order

4. Outputs

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Exploratory Data Analysis & Preprocessing (`eda_preprocessing.py`)

2. Model 1: Hybrid SARIMA-XGBoost (`model_1_sarima_xgboost.py`)

3. Model 2: Hybrid STL-LSTM (`model_2_stl_lstm.py`)

Packages