Tuesday, January 8, 2019

Multiple regression demo

I've been doing some data science work recently, and I've found myself in a position where I need to understand the effects of multiple independent variables on a single dependent variable. One of the most commonly-used techniques to accomplish view the relationships is multiple regression, but I've found surprisingly few good tutorials online that explain how it works.

For this demonstration, I will focus on the relationship between height, weight, and blood pressure. I created a completely artificial data set for illustrative purposes. The demonstration uses Statsmodels and Python, your mileage may vary with other packages. First, let's import the relevant packages:

import pandas as pd
import numpy as np 
import statsmodels.api as sm
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import pyplot as plt
from pylab import rcParams
plt.ion()

The first thing I want do do is create the dataset. I am intentionally forcing height and weight to be correlated in order to demonstrate the performance of multiple regression on correlated data.

N = 100
xName1 = 'Height cm'
xName2 ='Weight kg'
yName = 'Blood Pressure (sys)'
df = pd.DataFrame(columns = [xName1, xName2, yName])
df[xName1] = [np.random.randint(140, 200) for i in range(N)]
df[xName2] = df[xName1]/2 + 8 * (np.random.randn(N)-0.5)
df[yName] = -1*df[xName1]+1.5*df[xName2] + 5*(np.random.randn(N)-0.5) + 160
X1 = df[xName1] 
X2 = df[xName2]
XBoth = df[[xName1, xName2]]
y = df[yName] 

Note that I initialize the first column, X1 (height) as a set of 100 random numbers from 140 to 200. The second column, X2 (weight) is the height divided by 2 plus some Gaussian noise. The final column, y (blood pressure) is created by applying the formula:
-1 * height + 1.5 * weight + noise + 160
To show that the height and weight are correlated, I wrote a custom function, plot2D, which we'll use later.

def plot2D(model, dataX, dataY, color):
    xName = dataX.name
    yName = dataY.name
    params = [float(x) for x in model.params]
    intercept = params[0]
    slope = params[1]
    prediction = slope*dataX + intercept
    plt.plot(dataX, prediction, color)
    plt.scatter(dataX,dataY, color = 'k', facecolors='none')
    plt.xlabel(xName)
    plt.ylabel(yName)
    pvaldict = model.pvalues
    paramdict = model.params
    l1 = ['p_' + str(x) + ' = ' + str('{:0.0e}'.format(pvaldict[x])) + ' ' for x in pvaldict.keys() ]
    l2 = ['Coef_' + str(x) + ' = ' + str(round(paramdict[x],2)) + ' ' for x in paramdict.keys()]
    l3 = 'R^2 = ' + str(round(model.rsquared,3)) + ' f_p = ' + str('{:0.0e}'.format(model.f_pvalue))
    plt.title(' '.join(l1) + '\n' + ' '.join(l2) + '\n' + l3)

Then I created a two-variable regression model in statsmodels to predict the height from the weight and plot the regression line and data points using the above plot2D function:

modelcorr = sm.OLS(X2, sm.add_constant(X1)).fit()
plt.figure('Corr')
plot2D(modelcorr, X1, X2, 'gray')
plt.tight_layout()
plt.show()

Note that I have to add a constant to X1 in order to get a regression with an intercept. I obtain the following plot:


Note the high R^2 value (0.554) in the title. I created the 'weight' variable by diving 'height' by a constant and adding a bit of noise, so this is what we should expect. The p-value of the model (f_p)  is also quite low, indicating that we can reject the null hypothesis that Height and Weight are uncorrelated. The coefficient of the regression that we obtained here (coeff_Height cm) is 0.53, which is very close to the coefficient we used when creating the weight from the height - as it should be. (For the moment, ignore the other variables on top, we'll get back to those.)

What we want to do now is see how Height and Weight affect blood pressure. First, we'll perform separate single regressions for Height vs. Blood pressure (model1) and Weight vs. Blood pressure (model2). Then we'll perform a multiple regression, predicting blood pressure from both weight and height together. To create the models:

model1 = sm.OLS(y, sm.add_constant(X1)).fit()
model2 = sm.OLS(y, sm.add_constant(X2)).fit()
model3 = sm.OLS(y, sm.add_constant(XBoth)).fit()

Now we want to visualize these models and look at their stats. We'll start with the single-variable models, model1 and model 2.

color1 = 'r'
color2 ='b'

fig = plt.figure('Demo', figsize = (24,15))
ax = fig.add_subplot(2, 3, 1)
plot2D(model1, X1, y, color1)

ax = fig.add_subplot(2, 3, 4)
plot2D(model2, X2, y, color2)

We obtain these plots:

Note that while we do observe some single-variable correlation between the Height vs. Blood Pressure and Weight vs. Blood Pressure, the R^2 here is relatively low (0.091 and 0.14, respectively). 

Because our data is, in fact, 3-dimensional, I'll show the same data and regression lines on a 3-D plot. When we plot the regression lines on a 3-D plot, the lines will actually be planes which have a slope only in the direction of a single variable. To plot the 3D data, we'll need a new function.

def plot3D(ax, model, dataX1, dataX2, dataY, coeffs, color):
    xName1 = dataX1.name
    xName2 = dataX2.name
    yName = dataY.name
    meshx = np.linspace(min(dataX1),max(dataX1),55)
    meshy = np.linspace(min(dataX2),max(dataX2),55)
    X,Y = np.meshgrid(meshx,meshy)
    Z = coeffs[0] + coeffs[1]*X +  + coeffs[2]*Y
    ax.plot_surface(X, Y, Z, color = color, alpha = 0.75)
    ax.scatter3D(dataX1, dataX2, dataY, color = 'black')
    ax.set_xlabel(xName1)
    ax.set_ylabel(xName2)
    ax.set_zlabel(yName)
    pvaldict = model.pvalues
    paramdict = model.params
    l1 = ['p_' + str(x) + ' = ' + str('{:0.0e}'.format(pvaldict[x])) + ' ' for x in pvaldict.keys() ]
    l2 = ['Coef_' + str(x) + ' = ' + str(round(paramdict[x],2)) + ' ' for x in paramdict.keys()]
    l3 = 'R^2 = ' + str(round(model.rsquared,3)) + ' f_p = ' + str('{:0.0e}'.format(model.f_pvalue))
    plt.title(' '.join(l1) + '\n' + ' '.join(l2) + '\n' + l3, y=1.08)

In addition to taking the model itself as a parameter, we need to specify the coefficients of the plane in 3-D space, because the 2-D models (model1 and model2) will only give us coefficients in 2 dimensions; we need to add a 0 to create a slope in the missing dimension. So we have:

coeffs1 = [model1.params[0], model1.params[1], 0]
coeffs2 = [model2.params[0], 0, model2.params[1]]
coeffs3 = [model3.params[0], model3.params[1], model3.params[2]]

Finally, we plot model1 and model2 in 3D space:

ax = fig.add_subplot(2, 3, 2, projection='3d')
plot3D(ax, model1, X1, X2, y, coeffs1, color1)

ax = fig.add_subplot(2, 3, 5, projection='3d')
plot3D(ax, model2, X1, X2, y, coeffs2, color2)

And we obtain:

 Note that the 3-D plots here are the same as the 2-D plots above; we're simply showing the regression line in 3-D. On the top, for example, we see that blood pressure is predicted to decrease with height, but this prediction doesn't care about weight at all - for any given value of height, traversing along the data in the 'Weight' direction won't change the prediction. Same thing for the graph on  bottom. Moving along the 'weight' direction will change your prediction, but not moving along the 'Height' direction.

Finally, the multivariate regression:

ax = fig.add_subplot(2, 3, 3, projection='3d')
plot3D(ax, model3, X1, X2, y, coeffs3, 'green')

ax = fig.add_subplot(2, 3, 6, projection='3d')
plot3D(ax, model1, X1, X2, y, coeffs1, color1)
plot3D(ax, model2, X1, X2, y, coeffs2, color2)
plot3D(ax, model3, X1, X2, y, coeffs3, 'green')
plt.tight_layout()
plt.show()

And we obtain:



The top plot shows the prediction of the multivariate regression by itself, the bottom plot shows the multivariate prediction overlayed with the two single-variable predictions from the graphs above. Note, first of all, that the multivariate prediction is *very* different than each of the single variable predictions. Because the mutlivariate model can cave a slope in both the 'Weight' and 'Height' directions, the model that minimizes the mean squared error (MSE) with two variables gives us a drastically different fit than a model that uses only a single variable. Also note the R^2 here: 0.89, very close to a perfect fit. And, incredibly, the coefficients obtained for the prediction are almost the same as the coefficients we used when producing our data! When we originally made our data, we created it with an intercept of 160, a coefficient for height of -1, and a coefficient for weight of +1.5. The multiple regression model predicts these on the nose (intercept: 161.73, height: -0.99, weight: 1.42) despite the fact that the two independent variables were initially correlated with an R^2 of 0.55! Of course, when you have correlated independent variables the results won't always be this good, because multicolinearity is still a concern for multiple regression, but this model did a very good job despite the high correlation between height and weight.

Monday, January 7, 2019

Maharal on Study vs. Practice

In classical rabbinic sources, one can find a tension between the value of the study of Torah vs. the value of practice - i.e. the observance of commandments. The Talmud (Moed Katan 9b) discusses the question of whether a person should interrupt the study of Torah in order to perform one of the commandments, if such an opportunity presents itself. The Talmud concludes that in general, one should only pause his study for the performance of a religious obligation that he can only do himself, but not for an obligation which can be done by others. So you shouldn't interrupt your Torah study to run outside and help an old lady cross the street if there are other people around, but if you're the only one there then you are obligated to get up from your desk and go outside.

The reason for this, Maharal writes, is that - on the one hand - engagement with the Torah is an encounter with the intellectual/eternal/divine, which is far superior to any sort of interaction in the temporal physical world. On the other hand, when a person is faced with an obligation that only he can fulfill, he must fulfill it, otherwise he is lacking in something basic to his humanity. Here Maharal references the Talmudic maxim that the 248 positive commandments correspond to the "248 limbs of the human body". Failure to fulfill one's obligation is tantamount to missing a limb, which is something that a person has to rectify before they engage with higher spiritual realms through the study of Torah.

Maharal cites another passage about the tension between study and practice, this time from Kiddushin 30b. The Talmud relates three views pertaining to the question: "which is greater, study or practice"?

Rabbi Tarfon: Practice is greater.
Rabbi Akiva: Study is greater.
The sages: Study is greater, because study leads to practice.



The Talmud then cites the view of Rabbi Yossi, who seemingly agrees with Rabbi Akiva:

"Rabbi Yossi said: Great is the value of Torah, because the Torah was given 40 years before [the relevance of the commandment of] separating challah (a portion of the dough that is set aside and given to a priest when baking bread), 54 years before the commandment of taking tithes (portions of raw grain that need to be given to the priest, Levite, or poor people), and 104 years before  the Jubilee year (the 50th year after 7 cycles of 7 sabbatical years, at which point slaves are freed and land returns to its ancestral owners)." 
In other words, the Torah was given in the dessert, long before many of the commandments were relevant from the standpoint of practical observance, because many commandments assume the presence of the agriculture economy in the land of Israel, which wasn't in full swing until many years after the people of Israel left the dessert and entered Canaan.

Maharal explains each of the views presented, starting with Rabbi Tarfon. Similar to his comment before, Maharal explains Rabbi Tarfon's view as relating to the fact that the observance of commandments is a basic need; something man needs to do in order to attain completion as a human being. What about Torah study? Maharal here gives an analogy, comparing the observance of commandments to bread and Torah to wine. Even though wine is the more expensive and a more highly-prized product, says the Maharal, bread is still the basic staple of human sustenance. In the same way, even though Torah study may be more valuable in that it allows man to attain intellectual heights, the observance of commandments is more essential. Rabbi Akiva, on the other hand, thought to attribute  more importance to the "wine" - the study of Torah - which connects man to the world of the eternal.

The view of the sages, in the Maharal's interpretation, is simply that the Torah has two benefits: not only does it connect man to the spiritual/intellectual plane, it also teaches him the proper way to act in terms of the observance  of the commandments. So in this sense Torah study isn't just "wine", it has elements of the "bread" as well.

Finally we arrive at the view of Rabbi Yossi, which the Maharal elaborates upon in greater detail. Maharal first notes that all the commandments mentioned by Rabbi Yossi have the element of "sanctity", that is, challah, tithes, and the Jubilee year are all called "holy" at various points in the written Torah. Maharal claims that these commandments are exemplars of different aspects of the physical world that can attain sanctity.

Challah, which is an obligation that pertains to flour and water that have mixed together to become dough, is analogous to the matter we encounter on a regular basis - that is, complex matter, combined of multiple atomic parts. Food, paper, trees, humans, etc. all consist of multiple "simples" combined together, just like dough consists  of flour and water. The Midrash explicitly compares man to dough, in fact; God's making of man from the earth in Genesis is seen by Bereishit Rabbah as akin to a woman kneading flour and water together to make dough. However, man is more than just a complex object; he is a sanctified complex object, comparable to the challah which is separated from the dough after it is kneaded and declared holy.

Tithing, which is an obligation pertaining to raw grain, corresponds to simple, atomic objects, just like raw grain. The commandment to separate tithes demonstrates the ability of atomic objects to also become sanctified.

Finally, the Jubilee year is an example of temporal - as opposed to material - sanctification. The Jubilee comes as the 50th year after 7 of the 7-year sabbatical cycles, the sabbatical cycles representing temporal existence and the 50th year representing something that connects to temporal existence but also surpasses it, as the Jubilee year is not considered part of the 7-year cycle. So the Jubilee year has one foot inside of time and one foot outside of time, so to speak.

According to Maharal, Rabbi Yossi has moved from complex objects (challah) to simple\atomic objects (tithes) to the edge of temporal existence itself (the Jubilee year). However, Torah precedes all of these things, because it exists entirely in the intellectual domain, and is thus superior to anything that is constrained by matter and time.