Tag Archives: correlation

WHAT SEEMS TO BE TRUE OFTEN ISN’T

The Motley Fool provides advice on money management and investing. However, its recommendations can and should be used by people in other fields. For example, the following 20-word tip, from the “Fool’s School,” should be memorized by everyone who encounters statistically-based claims or findings in politics, medicine, psychology, education, and all other arenas of our lives:

“Never blindly accept what you read. Think critically about not just words, but numbers. They’re not always what they seem.”

Here are 5 examples illustrating how numbers in statistics often do NOT mean what they seem to indicate:

Example A

If the 14 players on a basketball team have a median height of 6 feet 6 inches, it might seem that 7 of those athletes must be shorter than 6’6” whereas 7 must be taller than that. Wrong!

Example B

If the data on 2 variables produce a correlation of +.50, it might seem that the strength of the measured relationship is exactly midway between being ultra weak and ultra strong. Not so!

Example C

If a carefully conducted scientific survey indicates that Candidate X currently has the support of 57% of likely voters with a margin of error of plus or minus 3 percentage points, it might seem that a duplicate survey conducted on the same day in the same way would show Candidate X’s support to be somewhere between 54% and 60%. Bad thought!

Example D

If a null hypothesis is tested and the data analysis indicates that p = .02, it might seem that there’s only a 2% chance that the null hypothesis is true. Nope!

Example E

If, in a multiple regression study, the correlation between a particular independent variable and the dependent variable is r = 0.00, it might seem that this independent variable is totally useless as a predictor. Not necessarily!

The Motley Fool’s admonition, shown above in italics, contains 20 words. If you can’t commit to memory the entirety of this important warning, here’s a condensed version of it:

“Numbers. They’re not always what they seem.”

Filed under Mini-Lessons, Misconceptions

INTERPRETING CORRELATIONS

If data exist on 2 variables (X & Y), the square of the correlation coefficient is called the “coefficient of determination.” This latter coefficient, if multiplied by 100, indicates the % of variability in either variable that’s associated with (or explained by) variability in the other variable.

For example, if r = .80, 64% of the variability in X is associated with variability in Y. Or, if r = –.40, 16% of the variability in X is associated with variability in Y.

To have at least 50% “explained variability,” the correlation must exceed ±.7071.

This number, .7071, is worth remembering because many researchers report that a correlation is “moderate” or has “medium strength” if r is near ±.50. In reality, such correlations are not so strong; they indicate that only about 25% of the variability in Y is associated with variability in X.

Filed under Important Numbers

TATOOS, MUSCLE SHIRTS, & BIKINI SWIM SUITS

QUESTION: What do you call a group of middle-aged adults standing in an orderly fashion waiting for the start of a store’s sale of bikini swim suits, muscle shirts, & tattoos?

(NOTE: This little effort at statistical humor comes from S. Huck)

Beyond the Joke:

In statistics, a regression line is typically used to predict a person’s status on a criterion variable of interest. This prediction is based upon that person’s status on a different variable that hopefully serves as a good predictor. For example, a regression line could be used to predict how high a college applicant’s GPA will be at the end of his or her first year in college based upon that person’s score on a college entrance exam. Or, a regression line could be used to predict a person’s systolic blood pressure based upon his or her BMI (body mass index).

To develop a regression line, a group of people initially must be measured on both the predictor variable and the criterion variable. After these data are used to identify the regression line, predictions can be made for new people who have scores on the predictor variable but are yet to be measured on the criterion variable. (It’s important, of course, that the new people for whom predictions are made do not differ in dramatic ways from those used to develop the regression line.)

Using data from the initial group of people, a regression line can be displayed visually in a bivariate scatter plot. In such a picture, the criterion variable (often called the dependent variable) is positioned on Y-axis while the predictor variable (typically called the independent variable) is put on the X-axis. Each data point indicates a given individual’s scores on the two variables. To illustrate, the following scatter plot shows data for a hypothetical group of 19 students measured in terms of how long they studied for an essay exam and how well they did once the exam was scored.

The position of a regression line in this or any other a scatter plot is determined by analyzing the data to identify two properties of the sought-after line: its slope and the place where the line passes through the Y-axis. These two features of the regression line are computed so as to minimize the vertical distances between the data points and the line. Because of this, the regression line, once determined, is considered to be the “best-fitting straight line.”

To use the regression line for predictive purposes, a new person is identified for whom there exists a score only on the predictor variable. First, that score is located on the scatter plot’s X-axis. We start at that point, move in a vertical direction until reaching the regression line, and then move to the left in a horizontal fashion until ending up on the Y-axis. That “destination point” represents the predicted score on the criterion variable. For example, we’d predict that a new student who studies only 1 hour for the exam will receive an exam score of 2.

Instead of making predictions via the scatter plot’s regression line, it’s possible to accomplish the same objective by using a formula. Any regression line can be converted into a formula that has this form:

predicted Y-score  =  Y-intercept + (slope)(observed X-score)

Using this formula, we predict that a person who studies 1 hour will earn an exam score equal to 1.5 + (0.5)(1) = 2.

The degree to which the regression line can make accurate predictions is influenced by the correlation between the predictor and criterion variables. To the extent that the correlation is high, data points in the scatter diagram will lie closer to the regression line, thus increasing predictability (so long as the “new” people for whom predictions are made resemble those in the initial group). Typically, the square of the correlation coefficient is used as an indicator as to how well the regression line will work. For the data in the accompanying scatter plot, r = 0.50 and r-squared = 0.25. This means that 25% of the variability in the exam scores is associated with (i.e., explained by) study time.

If you’d like a watch a good, 6-minute tutorial on the basic concept of a regression line and how it helps with prediction, click this link:

Filed under Jokes & Humor, Mini-Lessons

EVEN NOBEL LAUREATES MAKE MISTAKES

In his 2011 book entitled “Thinking: Fast and Slow,” Daniel Kahneman stated (on page 181):

“The correlation coefficient between two measures, which varies between 0 and 1, is a measure of the relative weight of the factors they share.”

Evidently, Kahneman truly did think that correlation coefficients must land on a continuum that extends from 0 to 1. That’s because he tried to help his readers “appreciate the meaning of the correlation measure” by presenting these 5 examples:

• The correlation between the size of objects measured with precision in English or in metric is 1.
• The correlation between self-reported height and weight among adult American males is .41.
• The correlation between SAT scores and college GPA is approximately .60.
• The correlation between income and educational level in the United States is approximately .40.
• The correlation between family income and the last four digits of their phone number is 0.

Note that each of these examples contains a correlation coefficient with a value somewhere between 0 and 1, inclusive. Not one example of a negative correlation (e.g., –.80) was provided.

There are 2 ways to correct Kahneman’s inaccurate sentence containing the words: “…varies between 0 and 1.” One obvious option is to change 0 to –1. The second option is to not change the 0, but instead to add these 3 words at the beginning: “The square of….”

If we square a correlation coefficient, we produce something called the “coefficient of determination.” (For example, if the correlation between 2 variables, X and Y, is –.50, the coefficient of determination is .25.) If we now change the coefficient of determination into a percentage, we get the percentage of variability in the X variable that is associated with, or explained by, variability in the Y variable. Note that this works just as well with negative correlations as it does with positive correlations.

Filed under Mini-Lessons

A STATISTICAL ODDITY

The standard deviation is almost always smaller than the variance, as the former is equal to the square root of the latter. For example, if the variance is 25, SD = 5. But consider what happens if the variance is between 0 and 1. In these fully legitimate situations, SD > variance.

Do real data ever produce variances and SDs that are smaller than 1.00? Most certainly! Here are 5 examples:

•  Statements in an attitude inventory typically are set up in a Likert format, with response-options extending from “strongly agree” to “strongly disagree” and scored 1 through 5. The variance and SD of the responses to any given item usually are < 1.00.
• The correlation between 2 variables is often computed separately for subgroups, with the mean r reported along with the SD or variance of the correlation coefficients. This SD or variance must be < 1.00.
• Both the SD and variance of a set of proportions necessarily will turn out to be < 1.00
• In test-development efforts, the mean item discrimination for subgroups of items (or for all items) is often reported, along with the SD or variance of these item indices. Because item discrimination can range from 0-to-1, the variability of a set of these values must be < 1.00.
• In meta-analysis investigations that combine the results of different factor analytic studies, the mean factor loading is sometimes reported for each factor along with the SD or variance (that must be < 1.00) of the given factor’s loadings.

Filed under Mini-Lessons

CAN SOMETHING THAT’S TINY BE SIGNIFICANT?

Correlation coefficients can assume values between −1.00 & +1.00. Usually, researchers compute a correlation and then check to see if it’s significantly different from 0.00. Recently, a published report displayed several correlations, each of which had been tested to see if it was significant. One of the computed correlations was equal to 0.01. Amazingly, it was significantly different from 0.00, with p < .05. How could this be? The sample size was gigantic (27,687), that’s how. The enormous amount of data allowed the researchers to say, correctly, that the r of 0.01 was statistically significant. However, it clearly had no practical significance whatsoever.

The moral here should be clear. If you are on the receiving end of a researcher’s statistical summary, and if he or she points out that 1 of the study’s correlations is “significant,” don’t let that single fact cause you to think that a strong relationship has been uncovered. Weak relationships can be statistically significant if n is massive. And if you, yourself, are the researcher who has collected data and done the statistical analysis, don’t look only at the magnitude of p and then get excited if it’s small enough to beat your alpha level. If you fail to pay attention to the actual size of the correlation coefficient (or, better yet, to the size of r2), you soon may find yourself being accused—legitimately—of “making a mountain out of a molehill.”

Filed under Mini-Lessons

STARTING POSITIONS AND ORDER OF FINISH IN THE INDY 500

There usually are 33 cars in the Indianapolis 500 auto race. At the start, these cars are arranged in 11 rows, 3 per row. Starting positions are based on the cars’ demonstrated speeds during time trials, with the fastest cars put up front. You might think that the cars’ order of finish should approximate their starting positions. Think again! In 2012, only one-third of the cars had end-of-race ranks within 3 spots of their starting positions. Considering all 33 cars, the correlation between starting position and order of finish was a modest 0.40. Over the past 25 years, this correlation has an average (i.e., mean) value of only 0.30.