Walk through TIME — Part-1 (Basics of Time Series Analysis in Python)

Hitesh Tripathi
14 min readDec 26, 2020
Source: Educba

1. Introduction

Time series data is series of indexed data points(or listed or graphed) in time order or sequence taken at successive equally spaced points in time. Thus it is dependent on time as sequence of discrete-time data.

Time series analysis comprises modelling techniques for analysing data points in time, order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values, to look into near future.

Analysis of Time series can further be divided into Descriptive and Predictive(Forecasting) based on area of research and their related questions. Time series data can further divided into three types:

Time series data: A set of observations on the values that a variable takes at different times.

Cross-sectional data: Data of one or more variables, collected at the same point in time.

Pooled data: A combination of time series data and cross-sectional data.

Time Series analysis can be useful to see how a given asset, security or economic variable changes over time. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average (DJI). Time series analysis is vastly spread over multi-domain areas(from chromatographic/spectral analysis to share price in stock markets) including its analysis. Glimpse of TSA in below as beautiful diagram.

Source: LaptrinhX

Extrapolation is involved when forecasting with the time series analysis which is extremely complex. But, the forecasted value along with the estimation of uncertainty associated with that can make the result extremely valuable.

Source: educba

Here we deep dive in Time series analysis(TSA) as a novice approach using python coding to explore data insights. Here are some most common python libraries used in Time series Analysis (but not restricted to these only)

Numpy(matrix data structures and mathematical operations on arrays such as trigonometric, statistical, and algebraic problems and Pandas objects rely heavily on NumPy objects), Pandas(working with structured like tabular, multidimensional, potentially heterogeneous and time series data), Datetime(consists two modules − datetime and calendar, datetime functionality for reading, formatting and manipulating time), SciPy(uses NumPy arrays as the basic data structure, and comes with modules for various commonly used tasks in scientific programming, including linear algebra, integration (calculus), ordinary differential equation solving, and signal processing), Seaborn(data visualisations library based on matplotlib), Scikit Learn(used for statistical modelling, machine learning and deep learning, as it contains various customisable regression, classification and clustering models-module for machine learning built on top of SciPy), Statsmodels(statistical data exploration and statistical modelling and computations), Matplotlib(data visualization in various formats), Prophet(Microframework for modelling and analyzing financial markets and backtesting)

2. Exploratory Analysis (EDA)

2.1 Loading Data from CSV file

Loading csv file(comma separated value or other format file) using python pandas.

  • The Pandas library in Python provides excellent, built-in support for time series data.
  • Data PREPROCESSING a crucial step in data mining, depends on problem used to transform the raw data in a useful and efficient format.
  • Indexing Time series Data, in pandas represents time series datasets as a Series (A Series is a one-dimensional array with a time label for each row) or Dataframe (collection of series of columns and rows, simply it is a table with rows and columns)
# importing libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
# Download csv file from resources and put it in working directory, header =[0] column label as header of table
dataframe = pd.read_csv(‘daily-total-female-births-CA.csv’, header=[0] , parse_dates=[0])
dataframe.head()
Top 5 rows

2.2 Loading Data as a series

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.

#Squeeze will try to reduce the dimension if its possible to reduce. As you can see above dataframe reduced to 
#only one column as date column taken as index and its value taken as births, can be reduced to series by squeeze
#parameter.
series = pd.read_csv(‘daily-total-female-births-CA.csv’, header=0, parse_dates=[0], index_col=[0], squeeze=True)
series.head()
Same above Data in series

Here is Dataset

2.3 Indexing Time Series Data

# As there is only one column in series index(date) and births
series.shape
# There are two columns in series index(date) and births
df2.shape

Querying by time

#prints selectively series data of only of Jan 1959
print(series['1959-01'])
#Selecting date between a time period
df2[(df2['date'] > '1959-01-01') & (df2['date'] <= '1959-01-21')]

3. Data Visualisation :

Time series data plot visualisation is an important step as it helps us to provides an insight on the data and understand what data is telling us and what we can infer from that details. Here below few key points to inspect in visualisations. This is a brief overview of decomposition.

By visualising observation in time series or patterns can be concluded:

  1. Trend (Increasing or decreasing pattern observed over a period of time)
  2. Seasonality (short term movements occur due to seasonal factors e.g., the quarter of the year, the month, or day of the week)
  3. Cyclic component (pattern exists when data exhibit rises and falls that are not of fixed period eg. business cycle like recession)
  4. Noise (no pattern, just random variation)
Patterns in Time Series Data 1. Noise(Zoom out — year data, 2. Trend+Seasonality 3. Trend+Deseasonlize
Decomposition of Time Series
Regression plot, checking fit and outliers
#importing library 
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
sns.regplot(x= df2.index.values, y=df2['births'])

sns.regplot(x= df2.index.values, y=df2['births'], order =1)
Fig (Left)- Time series plot (seasonality+Noise) and Fig (Right)- Regression plot

4. Feature Engineering

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. A feature is an attribute or property shared by all of the independent units on which analysis or prediction is to be done. Any attribute could be a feature, as long as it is useful to the model or solving the problem. These features can be used to improve the performance of machine learning algorithms and differ from data to data and problem to problem.

4.1 Date time features

For manipulating dates and times related informations such as their formats to accessible formats for analysis, using python datetime library.

df2.head(5)
features = df2.copy()
#Adding Year in our dataset
features['year'] = df2['date'].dt.year
#Adding Month in our dataset
features['month'] = df2['date'].dt.month
#Adding day in our dataset
features['day'] = df2['date'].dt.day
features.head(5)

4.2 Lag feature

Lag is essentially delay using shift() function, just as correlation shows how much two timeseries are similar, autocorrelation describes how similar the time series is with itself. A lag plot is a special type of scatter plot with the two variables (X,Y) “lagged.” Consider a discrete sequence of values, for lag 1, you compare your time series with a lagged time series or past values are known as lags, so t-1 is lag 1, t-2 is lag 2, and so on.

To create lag1 feature

Lag Feature of t-1 or First order lag, i.e, lag 1
features['lag1'] =  df2['births'].shift(1)
features['lag2'] = df2['births'].shift(365)
features.head()

Why Lag Plots used ? Lag Plots used to check Model suitability and Model selection, Outliers(easily distinguishable on lag plot), Randomness (pattern of Data points), Serial correlation/Autocorrelation (error terms in a time series transfer from one period to another), Seasonality(plotting observations for a greater number of periods or lags. Data with seasonality will repeat itself periodically in a sine or cosine-like wave)

Lag Plots of 3 different datasets (Source start from left- 1, 2, 3 from Kaggle)

Creating Lag plots

miles_df[‘lag1’] = miles_df[‘MilesMM’].shift(1)
sns.regplot(x = miles_df['lag1'], y = miles_df['MilesMM'])

4.3 Window feature (Rolling Window Feature)

The Forecast Point defines an arbitrary point in time that a prediction is being made. The Feature Derivation Window (FDW) defines a rolling window, relative to the Forecast Point, which can be used to derive descriptive features. Finally, the Forecast Window (FW) defines the range of future values we wish to predict, called Forecast Distances (FDs).
Rolling window asses the stability of the model over time. A common time-series model assumption is that the coefficients are constant with respect to time. Checking for instability amounts to examining whether the coefficients are time-invariant. The forecast accuracy of the model.

Source: kdnuggets
Source: Mathworks
# Rolling mean shifts in a window and calculates mean of the window.
features[‘Roll_mean’] = df2[‘births’].rolling(window = 2).mean()
features.head(5)
features['Roll_max'] = df2['births'].rolling(window = 3).max()
features.head(5)

4.4 Expanding window features

This is extended version of the rolling window technique. In this rolling window, the size of the window is constant while the window slides as we move forward in time. Hence, we consider only the most recent values and ignore the past values.

features['Expand_max'] = df2['births'].expanding().max()
features.head(10)

5. Stationarity

It is an important concept in time series analysis, Stationarity means that the statistical properties of a process generating a time series do not change over time.

Why Stationarity is imp.? If time series is not stationary statistical properties of system changes over time and forecasting can’t give any results. For Prediction, overall behaviour of data should remain constant then only statistical tests and models working on it give meaningful insights.

Conditions for stationarity:

  1. Constant Mean over time.
  2. Constant Variance over time (Homoscedasticity)
  3. Auto-covariance
Constancy in Mean and Variance

In time series data, Trend or seasonality in Time series contribute to Non-stationarity. Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times.

Seasonality is the presence of variations that occur at specific regular intervals less than a year, such as weekly, monthly, or quarterly. Seasonality may be caused by various factors, such as weather, vacation or any event and consists of periodic, repetitive, and generally regular and predictable patterns in the levels of a time series.

Seasonal fluctuations in a time series can be contrasted with cyclical patterns. The latter occur when the data exhibits rises and falls that are not of a fixed period. Such non-seasonal fluctuations are usually due to economic conditions and are often related to the “business cycle”; their period usually extends beyond a single year, and the fluctuations are usually of at least two years.

Methods to Check or Detect stationarity(Most common tests):

  1. Visual Inspection (as a general idea)- There are some basic properties(Trend or seasonality) of non-stationary data that we can look for.

2. Autocorrelation Plot (ACF)- Most widely used tool in Time series analysis as it used to check stationarity and seasonality. For a stationary time series, the ACF will drop to zero relatively quickly, while the ACF of non-stationary data decreases slowly. Also, for non-stationary data, the value of r1 is often large and positive.

For Stationary TS: The ACF of the Google stock price (left) and of the daily changes in Google stock price (right).
For Non Stationary TS

Autocorrelation plots are a commonly-used tool for checking randomness in a data set. This randomness is ascertained by computing autocorrelations for data values at varying time lags. If random, such autocorrelations should be near zero for any and all time-lag separations. If non-random, then one or more of the autocorrelations will be significantly non-zero.

from pandas.plotting import autocorrelation_plot
autocorrelation_plot(miles_df['MilesMM'])

3. The Augmented Dickey Fuller (ADF) Test (Unit root test)- Tests the null hypothesis(Ho) that a unit root is present in a time series sample. The alternate hypothesis(Ha) is different depending on which version of the test is used, but is usually stationarity or trend-stationarity. This test is valid for large samples. Result depend on p-value (p-value obtain greater than 0.05/critical value then Non-stationary time series and vice versa) obtained at end.

#augmented dickey fuller test
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):

#Determining rolling statistics
movingAverage = timeseries.rolling(window=12).mean()
movingSTD = timeseries.rolling(window=12).std()

#Plot rolling statistics:
orig = plt.plot(timeseries,color='blue',label='Original')
mean = plt.plot(movingAverage, color='red', label = 'Rolling mean')
std = plt.plot(movingSTD, color='black', label = 'Rolling std')
plt.legend(loc='best')
plt.title('Rolling mean and Standard deviation')
plt.show(block=False);

#Perform dickey fuller test:
print ("Results of Dickey Fuller Test :")
dftest = adfuller(timeseries['#Passengers'], autolag = 'AIC')
dfoutput = pd.Series(dftest[0:4], index= ['Test Statistic', 'p-value', '#Lags used', 'No. of Observations used'])
for key,value in dftest[4].items():
dfoutput['Critical value (%s)' %key] = value
print (dfoutput)
test_stationarity(indexed_data)
Stationarity comparison of Time plot (Airline Passengers Dataset)

3. The KPSS (Kwiatkowski-Phillips-Schmidt-Shin)Test (Unit root test)- Used for testing a null hypothesis that an observable time series is stationary around a deterministic trend (i.e. trend-stationary) against the alternative of a unit root.

Methods for making Time series stationary (by Detrending and Deseasonalize)

Removing Seasonality:

The model of seasonality can be removed from the time series. This process is called Seasonal Adjustment, or Deseasonalizing. A time series where the seasonal component has been removed is called seasonal stationary.

6. Transforms for Time Series Data

Transforming Non-stationary time series to Stationary time series.

Any transform operations applied to the series also require a similar inverse transform to be applied on the predictions. This is required so that the resulting calculated performance measures are in the same scale as the output variable and can be compared to classical forecasting methods.

  • Power Transform (lambda parameter)
    A power transform removes a shift from a data distribution(change scale of data) to make the distribution more-normal (Gaussian).It is a family of functions that are applied to create a monotonic transformation of data using power functions. This is a useful data transformation technique used to stabilize variance specially in time series, make the data more normal distribution-like, improve the validity of measures of association such as the Pearson correlation between variables and for other data stabilisation procedures.Popular examples are the log transform (positive values) or generalizes versions such as the Box-Cox transform (positive values) or the Yeo-Johnson transform (positive and negative values).
Original Plot (Before Log Transform)
Log Transform
  • Difference Transform
    A difference transform is a simple way for removing a systematic structure from the time series performed further in coding session. For example, a trend can be removed by subtracting the previous value from each value in the series. This is called first order differencing. The process can be repeated (e.g. difference the differenced series) to remove second order trends, and so on.
miles_df['MilesMM'].plot()
miles_df['MilesMM_diff_1'].plot()
Fig a(Left) — Without Differencing Fig b(Right) With Differencing

Below Log transform used in combination with Differencing for better results and blend effect on stationarity.

logdata_diffshift = indexed_data_logScale - indexed_data_logScale. shift()
Difference of Log Transform and Log with Difference with Previous value
  • Standardisation (mean and standard deviation statistics)
    Standardisation is a transform for data with a Gaussian distribution, Mean and standard deviation is used for scaling. It subtracts the mean and divides the result by the standard deviation of the data sample. This has the effect of transforming the data to have mean of zero, or centered, with a standard deviation of 1. This resulting distribution is called a standard Gaussian distribution, or a standard normal, hence the name of the transform.
    X_new = (X — mean)/Std
  • Normalisation (min and max values)
    Normalisation is a rescaling of data from the original range to a new range between 0 and 1. This scales the range to [0, 1] or sometimes [-1, 1]. This transformation squishes the n-dimensional data into an n-dimensional unit hypercube. Normalisation is useful when there are no outliers as it cannot cope up with them.
    X_new = (X — X_min)/(X_max — X_min)

7. Decomposing Time Series

The decomposition of time series is a statistical process that deconstructs a time series into several components, each representing one of the underlying categories of patterns. This is an important technique for all types of time series analysis, especially for seasonal adjustment. It seeks to construct, from an observed time series, a number of component series (that could be used to reconstruct the original by additions or multiplications) where each of these has a certain characteristic or type of behavior. For example, time series are usually decomposed into:

  • Trend: The trend component at time t, which reflects the long-term progression of the series (secular variation). A trend exists when there is a persistent increasing or decreasing direction in the data. The trend component does not have to be linear.
  • Seasonal: The seasonal component at time t, reflecting seasonality (seasonal variation). A seasonal pattern exists when a time series is influenced by seasonal factors. Seasonality occurs over a fixed and known period (e.g., the quarter of the year, the month, or day of the week).
  • Cyclic: The cyclical component at time t, which reflects repeated but non-periodic fluctuations. The duration of these fluctuations depend on the nature of the time series.
  • Irregular (Noise): The irregular component (or “noise”) at time t, which describes random, irregular influences. It represents the residuals or remainder of the time series after the other components have been removed.

Decomposition Models- The interactions between trend and seasonality are typically classified as either additive or multiplicative. The additive model is useful when the seasonal variation is relatively constant over time. The multiplicative model is useful when the seasonal variation increases over time.

Additive Model- The components add together to make the time series. If you have an increasing trend, you still see roughly the same size peaks and troughs throughout the time series. This is often seen in indexed time series where the absolute value is growing but changes stay relative.

y(t) = Trend + Seasonality + Cyclic + Noise

Multiplicative Model- In a multiplicative time series, the components multiply together to make the time series. If you have an increasing trend, the amplitude of seasonal activity increases. Everything becomes more exaggerated.

y(t) = Trend * Seasonality * Cyclic * Noise

from statsmodels.tsa.seasonal import seasonal_decompose
miles_decomp_df = pd.read_csv('us-airlines-monthly-aircraft-miles-flown.csv', header=0 , parse_dates=[0])
miles_decomp_df.index = miles_decomp_df['Month']
result = seasonal_decompose(miles_decomp_df['MilesMM'], model='additive')
result.plot()
plt.show()

Time Series Modelling and forecasting techniques will be covered in next post.

If you found this Article interesting and contributed to your learning, give it a few claps or better still share it with your friends or colleagues.

You can find me on LinkedIn: https://www.linkedin.com/in/hiteshtripathi/

References:

--

--

Hitesh Tripathi

Statistician | Analyst | Technology Enthusiast | Pantheist | Explorer |