Time Series Analysis using Box and Jenkins Methods

We will consider the analysis of the time series data available in the Kaggle. The data is available in excel format.

import pandas as pd
data= pd.read_excel('Electric_Production.xlsx',header=0)
data.head(3)
##        DATE  IPG2211A2N
## 0  1/1/1985     72.5052
## 1  2/1/1985     70.6720
## 2  3/1/1985     62.4502

Now we shall convert the first column to the date format.

data['DATE']=pd.to_datetime(data['DATE'], format='%m/%d/%Y')
data.head
## <bound method NDFrame.head of           DATE  IPG2211A2N
## 0   1985-01-01     72.5052
## 1   1985-02-01     70.6720
## 2   1985-03-01     62.4502
## 3   1985-04-01     57.4714
## 4   1985-05-01     55.3151
## ..         ...         ...
## 392 2017-09-01     98.6154
## 393 2017-10-01     93.6137
## 394 2017-11-01     97.3359
## 395 2017-12-01    114.7212
## 396 2018-01-01    129.4048
## 
## [397 rows x 2 columns]>

Now we shall set time as index for the time series.

data=data.set_index('DATE')
data.head(3)
##             IPG2211A2N
## DATE                  
## 1985-01-01     72.5052
## 1985-02-01     70.6720
## 1985-03-01     62.4502

As the first step we shall plot the time series data to check for stationarity.

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.plot(data.index, data['IPG2211A2N'], marker='o', linestyle='-', color='b')
plt.title("Time Series Plot of Electricity Production")
plt.xlabel("Date")
plt.ylabel("Electricity Production")
plt.grid(True)
plt.show()

Clearly we can see that the time series data has trend. Also there is seasonality and the variance shows an increasing trend with mean. Now let us confirm this by decomposing the time series into various components. As we have the monthly data, we use the period as 12. From the time series plot we can see that the seasonal component is almost constant. Hence we assume additive model for the data.

import statsmodels.api as sm
decompose=sm.tsa.seasonal_decompose(data.IPG2211A2N, model='additive',period=12)
decompose.plot()
plt.show()

Clearly the data has seasonality.

Now as the data is not stationary and we will go for differencing to find the value of \(d\) that makes the given time series stationary.

Hence we will apply first order differencing to make the given series stationary.

data['Difference1']=data['IPG2211A2N'].diff(1)
plt.figure(figsize=(12, 6))

# Original time series
plt.subplot(2, 1, 1)
plt.plot(data.index, data['IPG2211A2N'], marker='o', linestyle='-', color='b')
plt.title("Original Time Series")
plt.xlabel("Date")
plt.ylabel("Electricity Production")

# Plotting differenced time series
plt.subplot(2, 1, 2)
plt.plot(data.index, data['Difference1'], marker='o', linestyle='-', color='r')
plt.title("Differenced Time Series (1st Order)")
plt.xlabel("Date")
plt.ylabel("Differenced Value")

plt.tight_layout()
plt.show()

We can suspect a variance instability in the differenced data. However we try to fit a ARMA model to this differenced data and then diagonostic check are done on the residuals.

We will check the stationarity of the differenced data now.

from statsmodels.tsa.stattools import adfuller
adf_result=adfuller(data.IPG2211A2N)
print("ADF Statistic:", adf_result[0])
## ADF Statistic: -2.256990350047235
print("p-value:", adf_result[1])
## p-value: 0.1862146911658712

For ADF test, null hypothesis is that the data is not stationary and we fail to reject null hypothesis here.

result1=adfuller(data['Difference1'].iloc[1:,])
print("ADF Statistic:", result1[0])
## ADF Statistic: -7.104890882267311
print("p-value:", result1[1])
## p-value: 4.0777865655394095e-10

Since the p-value is less than 0.05, we reject the null hypothesis that the given time series is not stationary. Hence the process is stationary.

Now we shall plot the ACF and PACF to identify the coefficients of AR and MA processes.

from statsmodels.tsa.stattools import acf, pacf
acf(data.Difference1.iloc[1:,],nlags=10)
## array([ 1.        ,  0.37453474, -0.42845915, -0.78734755, -0.40308672,
##         0.38518364,  0.75615662,  0.39082151, -0.4002376 , -0.78855506,
##        -0.40331459])

Now we shall plot the ACF and PACF.

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf


# Plot ACF and PACF
fig, ax = plt.subplots(2, 1, figsize=(10, 8))

# ACF plot
plot_acf(data.Difference1.iloc[1:,], lags=20, ax=ax[0])
ax[0].set_title("Autocorrelation Function (ACF)")

# PACF plot
plot_pacf(data.Difference1.iloc[1:,], lags=20, ax=ax[1], method='ywm',alpha=0.05)
ax[1].set_title("Partial Autocorrelation Function (PACF)")

plt.tight_layout()
plt.show()

ACF plot shows some seasonal variations. Hence we consider differencing the seasonal component

data['diff1s1'] = data['Difference1'].diff(12)
data = data.dropna(subset=['diff1s1'])

Now we shall consider the adf test for the new data set

result1=adfuller(data.diff1s1)
print("ADF Statistic:", result1[0])
## ADF Statistic: -8.022039209985028
print("p-value:", result1[1])
## p-value: 2.0639252090769555e-12
print("Critical Values:", result1[4])
## Critical Values: {'1%': -3.4482453822848496, '5%': -2.8694261442901396, '10%': -2.5709711770439507}

Since the p-value is less than 0.05, we have to reject the null hypothesis and we conclude thath the data is stationary. Now we will consider acf and pacf plots of the data.

plt.figure(figsize=(12, 6))

# ACF plot
plt.subplot(1, 2, 1)
plot_acf(data.diff1s1, ax=plt.gca(), lags=25)
plt.title("ACF Plot")

# PACF plot
plt.subplot(1, 2, 2)
plot_pacf(data['diff1s1'], ax=plt.gca(), lags=25, method='ywm')
plt.title("PACF Plot")

plt.tight_layout()
plt.show()