DoW #6: TVs and Life ExpectanciesFor this week’s DoW, you will explore the question:

Is there a relationship between life expectancy and the number of people per TV for a country?

The Excel file,  TV Life

 contains data for the variables Life Expectancy and People per TV for a sample of 22 countries. We will analyze and interpret this data throughout this week’s investigations.

In Investigation 1, you will post your responses to Exercise B4 by Wednesday, 10PM EST, and follow-up by Friday, 10 PM EST.

In Investigation 2, you will post your responses to Exercise E5 by Saturday 10 PM EST, and follow-up by Sunday, 10 PM EST.

Investigation 1: Measuring AssociationIn this investigation, we look at the concept of association – the relationship between the two variables – and ways to identify and measure the relationship in quantitative bivariate data. We will look at scatterplots and the correlation coefficient.
Inv 1, Activity A: Seeing the AssociationExercise A1: Complete Annenberg Series for Session 7, Parts A, B, and C. (We will complete Part D in Investigation 2, but you can do it here if you prefer.) Reflect on the following questions in your journal:

How does the contingency table (also called a two-way table) show the relationship seen in the scatter plot?
The height=armspan line is also called the y=x line (height is the y-axis variable, armspan is the x-axis variable). What does it mean if a point is above this line? below this line?
Exercise A2: Analyze the data for DoW #6 in your calculator or on an applet. Record your answers in your journal:

What are the variables? Are they quantitative or categorical?
Create a scatterplot for the data in DoW #6, with the variable People Per TV  on the x-axis.

Describe the relationship you see in the data (if any).
Are there any points on the scatterplot that do not seem to follow the general trend of the data?  If so, what are they and why do they seem “different”?

Inv 1, Activity B: Describing AssociationWe use the term association to refer to a relationship between two variables which would reveal information about one variable from information about the other variable. In this investigation we will look at the association between two quantitative variables.

Associations can be positive or negative, and they can be strong or weak.

Two variables have a:  

Positive association

If larger values of one variable tend to occur with larger values of the other variable. So, the two variables tend to increase (or decrease) together.
Negative association

If larger values of one variable tend to occur with smaller values of the other variable. So, as one variable tends to increase, the other tends to decrease.
Two variables have a:

Strong associationI If observations tend to closely follow the pattern (of positive association or of negative association). With a stronger association, one could more accurately use one variable to “predict” values of the other variable.
Weak association  if observations tend to follow the pattern more loosely. With a weaker association, predictions may not be as accurate as they would be with a strong association.
The table below shows the four combinations of positive and negative, strong and weak associations as they might appear in scatter plots, as well as one with nearly no association.

 

Exercise E1: Return to your work for DoW #6

Add a best fit line to the scatter plot you made in Exercise A2. Record the equation of this line. What do x and y represent in this equation? Here’s a video tutorial for doing this in excel. 
Add a Least Squares Regression line to the scatter plot you made in Exercise A2. Record the equation for this line. How does it compare to the line you placed? A video tutorial for doing this in excel is also available. 

Exercise E2: What is the proportion of variability for Least Squares Regression Line? Interpret this value in the context of the DoW.
Exercise E3: In the US, there are 1.3 people per TV.

Use the Least Squares Regression line to calculate a prediction for the Life Expectancy in the US.
The actual Life Expectancy in the US is 75.5 years. What is the error for the prediction you made in E2?

The error you calculated in Exercise E3 is called the Residual Error for the prediction.
Residual Error = Actual Value – Predicted Value
This is the same “error” you looked at in Activity D, when you found the SSE. (the SSE is a measure of the “total” error of the prediction line. ) 

The plot you made in Exercise E3 is called a Residual Plot. This plot allows us to easily identify points that do not fit the trend line. When looking a a distribution, we called observations that fell far outside the expected pattern of variation outliers.  In Week 4, we discussed ways to identify outliers in a distribution and considered removing them from the data in order to better see the patterns in the variability. A residual plot is a tool for identifying outliers in a scatter plot. Likewise, we consider whether or not to remove such points from the plot.

It is also possible for a point to appear to be in line with the trend of the data, but still be an outlier. Consider the point circled in the scatter plot below:

If this point were to be removed, it liked would have a drastic effect on the regression equation. It is called an influential outlier. Such points are far from the rest of the data, horizontally, and likely would appear as outliers in the distribution for the x-variable (using the test of 3 standard deviations from the mean). Removing such points can greatly affect the regression line (particularly the slope). For more information on outliers in a regression model, see the website.
Exercise E4: Right-Click on the scatter plot for DoW #6 and select Make Residual Plot. A new plot will appear beneath the scatter plot, showing the residual error for each data point.

Click on the two points in the Residual Plot that have the greatest Residual Errors. What countries are they? What are the errors in the prediction?
Save the file under a new name. Then, delete these two values. What happens to the correlation coefficient (r)? the best-fit equation? The proportion of variation(r^2)?
Would you consider these points to be outliers? Do you think we are justified in removing them from the data? Why or why not? Are there any other outliers you might remove? Explain.
Optional: Do you think there are any influential outliers for this regression?  If so, try removing them and note how the regression equation and proportion of variation change. Also, consider whether or not there is a reason for removing them (aside from the fact they affect the regression.)
The following links will take you to two tutorials for creating residual plots.

The first is for a graphing calculator.
The second for Excel.

Exercise E5: Return to our original question for DoW #6: Is there a relationship between life expectancy and the number of people per TV for a country? Consider the analyses you have completed for the DoW in Investigations 1 and 2. Now it’s time to interpret the analyses. How does the data answer this question? How do the tools we’ve used support this answer? What questions do you have about this analysis?

Write at least three summary statements interpreting DoW#6, supported with facts from the data and analyses.  

Post these three statements (as well as any additional thoughts or questions you have on the DoW) to your group’s DB by Saturday, 10 PM EST.