Tuesday, January 8, 2019

Multiple regression demo

I've been doing some data science work recently, and I've found myself in a position where I need to understand the effects of multiple independent variables on a single dependent variable. One of the most commonly-used techniques to accomplish view the relationships is multiple regression, but I've found surprisingly few good tutorials online that explain how it works.

For this demonstration, I will focus on the relationship between height, weight, and blood pressure. I created a completely artificial data set for illustrative purposes. The demonstration uses Statsmodels and Python, your mileage may vary with other packages. First, let's import the relevant packages:

import pandas as pd
import numpy as np 
import statsmodels.api as sm
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import pyplot as plt
from pylab import rcParams
plt.ion()

The first thing I want do do is create the dataset. I am intentionally forcing height and weight to be correlated in order to demonstrate the performance of multiple regression on correlated data.

N = 100
xName1 = 'Height cm'
xName2 ='Weight kg'
yName = 'Blood Pressure (sys)'
df = pd.DataFrame(columns = [xName1, xName2, yName])
df[xName1] = [np.random.randint(140, 200) for i in range(N)]
df[xName2] = df[xName1]/2 + 8 * (np.random.randn(N)-0.5)
df[yName] = -1*df[xName1]+1.5*df[xName2] + 5*(np.random.randn(N)-0.5) + 160
X1 = df[xName1] 
X2 = df[xName2]
XBoth = df[[xName1, xName2]]
y = df[yName] 

Note that I initialize the first column, X1 (height) as a set of 100 random numbers from 140 to 200. The second column, X2 (weight) is the height divided by 2 plus some Gaussian noise. The final column, y (blood pressure) is created by applying the formula:
-1 * height + 1.5 * weight + noise + 160
To show that the height and weight are correlated, I wrote a custom function, plot2D, which we'll use later.

def plot2D(model, dataX, dataY, color):
    xName = dataX.name
    yName = dataY.name
    params = [float(x) for x in model.params]
    intercept = params[0]
    slope = params[1]
    prediction = slope*dataX + intercept
    plt.plot(dataX, prediction, color)
    plt.scatter(dataX,dataY, color = 'k', facecolors='none')
    plt.xlabel(xName)
    plt.ylabel(yName)
    pvaldict = model.pvalues
    paramdict = model.params
    l1 = ['p_' + str(x) + ' = ' + str('{:0.0e}'.format(pvaldict[x])) + ' ' for x in pvaldict.keys() ]
    l2 = ['Coef_' + str(x) + ' = ' + str(round(paramdict[x],2)) + ' ' for x in paramdict.keys()]
    l3 = 'R^2 = ' + str(round(model.rsquared,3)) + ' f_p = ' + str('{:0.0e}'.format(model.f_pvalue))
    plt.title(' '.join(l1) + '\n' + ' '.join(l2) + '\n' + l3)

Then I created a two-variable regression model in statsmodels to predict the height from the weight and plot the regression line and data points using the above plot2D function:

modelcorr = sm.OLS(X2, sm.add_constant(X1)).fit()
plt.figure('Corr')
plot2D(modelcorr, X1, X2, 'gray')
plt.tight_layout()
plt.show()

Note that I have to add a constant to X1 in order to get a regression with an intercept. I obtain the following plot:


Note the high R^2 value (0.554) in the title. I created the 'weight' variable by diving 'height' by a constant and adding a bit of noise, so this is what we should expect. The p-value of the model (f_p)  is also quite low, indicating that we can reject the null hypothesis that Height and Weight are uncorrelated. The coefficient of the regression that we obtained here (coeff_Height cm) is 0.53, which is very close to the coefficient we used when creating the weight from the height - as it should be. (For the moment, ignore the other variables on top, we'll get back to those.)

What we want to do now is see how Height and Weight affect blood pressure. First, we'll perform separate single regressions for Height vs. Blood pressure (model1) and Weight vs. Blood pressure (model2). Then we'll perform a multiple regression, predicting blood pressure from both weight and height together. To create the models:

model1 = sm.OLS(y, sm.add_constant(X1)).fit()
model2 = sm.OLS(y, sm.add_constant(X2)).fit()
model3 = sm.OLS(y, sm.add_constant(XBoth)).fit()

Now we want to visualize these models and look at their stats. We'll start with the single-variable models, model1 and model 2.

color1 = 'r'
color2 ='b'

fig = plt.figure('Demo', figsize = (24,15))
ax = fig.add_subplot(2, 3, 1)
plot2D(model1, X1, y, color1)

ax = fig.add_subplot(2, 3, 4)
plot2D(model2, X2, y, color2)

We obtain these plots:

Note that while we do observe some single-variable correlation between the Height vs. Blood Pressure and Weight vs. Blood Pressure, the R^2 here is relatively low (0.091 and 0.14, respectively). 

Because our data is, in fact, 3-dimensional, I'll show the same data and regression lines on a 3-D plot. When we plot the regression lines on a 3-D plot, the lines will actually be planes which have a slope only in the direction of a single variable. To plot the 3D data, we'll need a new function.

def plot3D(ax, model, dataX1, dataX2, dataY, coeffs, color):
    xName1 = dataX1.name
    xName2 = dataX2.name
    yName = dataY.name
    meshx = np.linspace(min(dataX1),max(dataX1),55)
    meshy = np.linspace(min(dataX2),max(dataX2),55)
    X,Y = np.meshgrid(meshx,meshy)
    Z = coeffs[0] + coeffs[1]*X +  + coeffs[2]*Y
    ax.plot_surface(X, Y, Z, color = color, alpha = 0.75)
    ax.scatter3D(dataX1, dataX2, dataY, color = 'black')
    ax.set_xlabel(xName1)
    ax.set_ylabel(xName2)
    ax.set_zlabel(yName)
    pvaldict = model.pvalues
    paramdict = model.params
    l1 = ['p_' + str(x) + ' = ' + str('{:0.0e}'.format(pvaldict[x])) + ' ' for x in pvaldict.keys() ]
    l2 = ['Coef_' + str(x) + ' = ' + str(round(paramdict[x],2)) + ' ' for x in paramdict.keys()]
    l3 = 'R^2 = ' + str(round(model.rsquared,3)) + ' f_p = ' + str('{:0.0e}'.format(model.f_pvalue))
    plt.title(' '.join(l1) + '\n' + ' '.join(l2) + '\n' + l3, y=1.08)

In addition to taking the model itself as a parameter, we need to specify the coefficients of the plane in 3-D space, because the 2-D models (model1 and model2) will only give us coefficients in 2 dimensions; we need to add a 0 to create a slope in the missing dimension. So we have:

coeffs1 = [model1.params[0], model1.params[1], 0]
coeffs2 = [model2.params[0], 0, model2.params[1]]
coeffs3 = [model3.params[0], model3.params[1], model3.params[2]]

Finally, we plot model1 and model2 in 3D space:

ax = fig.add_subplot(2, 3, 2, projection='3d')
plot3D(ax, model1, X1, X2, y, coeffs1, color1)

ax = fig.add_subplot(2, 3, 5, projection='3d')
plot3D(ax, model2, X1, X2, y, coeffs2, color2)

And we obtain:

 Note that the 3-D plots here are the same as the 2-D plots above; we're simply showing the regression line in 3-D. On the top, for example, we see that blood pressure is predicted to decrease with height, but this prediction doesn't care about weight at all - for any given value of height, traversing along the data in the 'Weight' direction won't change the prediction. Same thing for the graph on  bottom. Moving along the 'weight' direction will change your prediction, but not moving along the 'Height' direction.

Finally, the multivariate regression:

ax = fig.add_subplot(2, 3, 3, projection='3d')
plot3D(ax, model3, X1, X2, y, coeffs3, 'green')

ax = fig.add_subplot(2, 3, 6, projection='3d')
plot3D(ax, model1, X1, X2, y, coeffs1, color1)
plot3D(ax, model2, X1, X2, y, coeffs2, color2)
plot3D(ax, model3, X1, X2, y, coeffs3, 'green')
plt.tight_layout()
plt.show()

And we obtain:



The top plot shows the prediction of the multivariate regression by itself, the bottom plot shows the multivariate prediction overlayed with the two single-variable predictions from the graphs above. Note, first of all, that the multivariate prediction is *very* different than each of the single variable predictions. Because the mutlivariate model can cave a slope in both the 'Weight' and 'Height' directions, the model that minimizes the mean squared error (MSE) with two variables gives us a drastically different fit than a model that uses only a single variable. Also note the R^2 here: 0.89, very close to a perfect fit. And, incredibly, the coefficients obtained for the prediction are almost the same as the coefficients we used when producing our data! When we originally made our data, we created it with an intercept of 160, a coefficient for height of -1, and a coefficient for weight of +1.5. The multiple regression model predicts these on the nose (intercept: 161.73, height: -0.99, weight: 1.42) despite the fact that the two independent variables were initially correlated with an R^2 of 0.55! Of course, when you have correlated independent variables the results won't always be this good, because multicolinearity is still a concern for multiple regression, but this model did a very good job despite the high correlation between height and weight.

No comments:

Post a Comment