It is very common to mix up the concept of association and interaction. Some people also assume that it is imperative for two variables to be associated before they interact. But that is not the truth. When we talk in the context of statistics, these terms have different implications f to signify the relationship between the variables. This becomes all the truer when one is talking about the predictions in the case of the Regression or ANOVA model.
Before we jump to the difference between Association and Interaction, let us briefly but explicitly understand the application of variables in Regression Analysis.
Regression is a statistical technique for finding out the relationship between a single dependent variable which is the criterion and the independent variable which is the predictor. Regression brings forth a predicted value for the criterion aka the dependent variable from a linear combination of the independent variables, aka the predictors. Regression analysis is primarily found to have two uses in the field of scientific literature. One is a prediction along with classification and the other is an explanation
The preliminary step in the regression analysis is to determine the criterion variable. The criterion has acceptable measurement qualities, which are reliability and validity. After having selected the criterion, the predictor variable must be identified. Which is the model selection. The purpose of model selection is to minimize the number of predictors which call for the maximum variance in the criterion. To put it in other words, the most efficient model maximizes the value of the coefficient of determinants. This coefficient estimates the amount of variance in the criterion score that is accounted for by a linear combination of predictor variables. The greater the value of R2, the lesser the error or what we also call the unexplained variance, and hence the better the prediction. R2 is dependent upon the multiple correlation coefficient(R). This does the job of describing the relationship that exists between the expected and the predicted criterion scores, where R equals 1.00. This talks about a perfect prediction where there is not any error and hence no unexplained variance or vice versa. (R2=1.00). In a situation where the value of R is 0.00, there exists no relationship between the predictors and the criterion, and the variance in the scores has also not been explained(R2=0.00). it implies that the chosen variable cannot predict the criterion. The purpose of model selection is, according to what has been stated previously, to build a model that results in the highest estimated value of R2.
According to seasoned researchers, the value of R is often overestimated. The degree of overestimation is often estimated by the sample size. The larger the ratio between the number of predictors ad subjects, the higher the overestimation. To make this better, it is always better to have a large sample size and there should be at least 20-30 subjects allocated to each predictor. The best and most effective way to determine the optimal sample size is through the technique of statistical power analysis.
Another effective way to determine the best model for prediction is to test the significance of adding another variable to the model. This can be done by applying a partial F-Test. The partial F test is like the F test that is used in the analysis of variance. It assesses the statistical significance of the difference between the values for R2 which have been derived from two or more prediction models using a subset of the variables from the main equation.
Though, these above techniques are certainly useful in discussing the most efficient model for prediction. In order to select the right variables, theory must be taken into consideration. Previous literature should be assessed and predictors should be selected for whom the relationship between criterion and predictor has been established.
Assessment of the accuracy of the model is best accomplished by trying to analyze the standard error or estimate also called SEE and further on the percentage of predicted mean represented by SEE(SEE%). The SEE is a representation of the degree to which the predicted scores are deviating from the observed scores on the measure of the criterion. This is quite like the standard deviation that is used in other statistical procedures. Experts suggest that if the values of SEE are lower that means the accuracy of the prediction is more precise. Comparing SEE for different models with the same sample creates room for determining the most accurate model that can be used for prediction. The formula for calculating SEE % is dividing SEE by the mean of the criterion. This can be easily applied to doing a comparison of different models that have been derived from different samples.
The most accurate and efficient model for prediction has been found and it’s advisable to assess the model for stability. One can only call a model stable when it can be applied to diverse samples from the same population and the accuracy of the prediction also does not get diluted. This can be achieved by doing cross-validation of the model. Cross-validation, as the name is suggestive, helps to find out how well the developed prediction model is in another sample that is drawn from the same population. Cross-validation can be done using more than one method. Some of them are using two independent samples, splitting the samples and the PRESS-related statistics that have been developed from the same sample. Let us explore them separately to understand their application.
The use of two independent samples involves the selection of two groups from the same population. The two groups can be classified separately as, the training or exploratory group that is used for establishing the model of prediction. The second group is the confirmatory or validatory group which is used to assess the model for its stability. The experts attempt to compare the R2 value from the two groups and the assessment of “shrinkage”. The difference between the two values of R2 is used as a sign of model stability. Though there is not any specific thumb rule of values to interpret the differences and indicate the stability of a model but expert researchers say that values that are less than 0.10 are indicators of a stable model. One should know that independent samples are used lesser in this context because they greatly impact the cost involved in the research.
The next technique is cross-validation which uses split samples. Once the sample has been chosen from the population, it gets randomly divided into two subgroups. One of the subgroups becomes the group that is called the exploratory sub-group and the other one is termed as the validatory sub-group. Here also the same process is followed and the values of R2 are calculated and the model stability is decided by the calculation of “shrinkage” as discussed above.
The third technique is PRESS-related statistics. It is a solution to the problem of data splitting. This method uses small sample sizes to assess the problem of bias and for the purpose of cross-validation of the model. The trick that is adopted by this technique is to calculate the desired test statistic multiple times and at each attempt, omit individual cases from the calculations. In this method, the difference in the actual values of the criterion for each individual and the predicted value for using the formula derived with the individual’s data removed from the prediction, are calculated. The PRESS statistic is the sum of the squares of the residuals derived from these calculations and is like the sum of squares for the error (SSerror) used in the analysis of variance (ANOVA). The PRESS statistic can be used to calculate a modified form of R2 and the SEE.
What are Association and Interaction and how are they similar and different?
Let us understand what is Association.
The name itself suggests that Association between two variables means that the value of one variable in some way is related to the value of the other variable. The technique for determining this, which is mostly used is by measuring the correlation for two continuous variables and by means of cross-tabulation and a chi-square test for two categorical variables.
Somehow, there is not a very effective measure for the purpose of association between one categorical and one continuous variable Point-biserial correlation works only if the categorical variable is binary. But either one-way analysis of variance or logistic regression can test an association (depending upon whether you think of the categorical variable as the independent or the dependent variable).
Primarily, association means the values of one variable generally co-occur with certain values of the other.
Let us understand what is Interaction.
Interaction is different from Association, even if two variables are associated it says nothing about whether they interact in any way of creating an impact on the third variable. Likewise, if two variables interact, it is possible that they may be associated or not associated.
An interaction between two variables means the effect of one of those variables on a third variable is not constant—the effect differs at different values of the other.
In the case of a Model, what do association and Interaction indicate?
We will try and understand this better with the help of an example. We will have three hypothetical variables, namely, X1, X2, and Y. We will look at three separate situations, in the light of these variables only. X1 will be a continuous variable, X2 a categorical independent variable and Y is a continuous independent variable. This is just one way to categorize them, in another situation, any of these variables could be either categorical or continuous.
Situation 1: Association without Interaction
In this first situation, X1 and X2 are associated. If Y is ignored, the mean of X1 is lowered when X2 =0 than when X2=1. But when it comes to affecting Y, they do not interact. The regression lines become parallel. X1 has the same effect on Y (the slope) for both X2=1 and X2=0.
A simple example is the relationship between height (X1) and weight (Y) in male (X2=1) and female (X2=0) teenagers. There is a relationship between height (X1) and gender (X2). But for both genders, the relationship between height and weight is the same.
This situation can be managed with the help of introducing control variables. Gender is a control variable here, and if that was not introduced, a regression would fit a single line to all these points. It would attribute all the changes in the weight of the respondents to differences in their heights.
Include all the points on a single line, this line would also make the line steeper. Because of this, the unique effect of height on weight would be overestimated.
Situation 2: Interaction without Association
In this second scenario, there is no association between X1 and X2. The mean for X1 remains the same for both categories of X2. But the way in which X1 is affecting Y is different for both the values of X2. This is how one can precisely define Interaction. The slope of X1 on Y is greater for X2=1 than it is for X2=0, in that case, there is no slope at all, and the line is nearly flat.
If we try to look at it as an example, X1 can be the pretest score and Y as the score post-test. Assume that the participants have been randomly assigned to a control(X2=1) or training which is (X2=0).
If the randomization is done fine, the assigned condition which is X2 would not have any relationship with the pretest score which is X1. However, they do have an interaction, the relationship between the pretest and post-test differs in two conditions. In a condition where control exists, without the exposure and effect of training, there would be a high correlation between the pre-test and post-test scores. But in the event where there is exposure to training, if the training does well, the pretest scores would not have much impact on the post-test scores.
Situation 3: Both Interaction and Association
In the third situation, there are both associations and interactions that exist. There is an association between X1 and X2. Again, the mean of X1 is lower in a situation when X2=0 than when X2=1. They also have an interaction with Y. The slopes of the relationship that exists between X1 and Y are different when X2=0 and X2=1. So, X2 has an impact on the relationship between X1 and Y.
A classic example here would be, y can be the number of jobs available in a state and X1 is the eligible workforce that is employable with the degree. X2 is whether the state is rural(X2=0) or Urban(nX2=1).
It is evident and understood that in rural states, the percentage of educated youth as well as job opportunities are lower as compared to urban states. Moreover, in rural regions, there is no relationship between the educational level of the workforce and the number of jobs that are available. It is reversed in the case of urban regions. This situation is also what you would see if the randomization in the last example did not go well or if randomization was not possible.
The distinction between Interaction and Association can become simpler to comprehend as more and more data is analyzed. As researchers, it is suggestive to have a look at your data from the vision of an explorer and use graphs to have a better understanding of what is happening with your variables.