I need help with Statistic Assignment
hw2_graph_paper.docx

hw2_variance.xlsx

wk2_hw2rev.docx

Unformatted Attachment Preview

THE MEANING OF VARIANCE AND THE STANDARD DEVIATION (The MEAN is the horizontal RED l
The BLUE lines in the top graphs represent the positive (+) and negative (-) distances from the m
“X”
1
2
3
4
5
6
7
8
9
10
DATA SET “A”
9
5
8
6
12
2
13
1
10
4
DATA SET “B”
8,5
5,5
7
9
5
7
8
6
9,5
4,5
DATA SET “C”
7
7
7
7
7
7
7
7
7
7
MEAN
7
7
7
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
VARIANCE
STD DEV
(1, 9)
0
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
1
2
3
1
2
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
DATA SET “B”
0
DATA SET “A”
4
5
6
7
8
9
10
DATA
0
1
2
IN GRAPHS:
THE BOTTOM, HORIZONTAL AXIS IS THE “X-AXIS” ALSO CALLED THE ORDINATE AND IS THE “INDEPENDENT VARIABLE” (LIKE E
IN THE GRAPHS ABOVE, THE X-VALUES SIMPLY REFER TO THE TEN DATA POINTS
THE VERTICAL AXIS IS THE “Y-AXIS” ALSO CALLED THE ABSCISSA AND IS THE “DEPENDENT VARIABLE” (INCOME DEPENDS ON E
IN GRAPHS EACH DATA POINT IS IDENTIFIED BY ITS X AND Y COORDINATES (X,Y). FOR EXAMPLE DATA POINT 1 IN DATA SET ”
TA SET “A”
+6
-5
3
4
5
6
7
8
9
10
DATA SET “C”
2
3
4
5
6
7
8
9
10
DEPENDENT VARIABLE” (LIKE EDUCATION VERSES INCOME WHERE EDUCATION IS THE INDEPENDENT VARIABLE )
WEEK 2 HW 2 (Based on Lane C3 and Illowsky C2.5 – 2.8)
TAKE THE TIME TO READ THE “INTRODUCTIONS TO STATISTICS” SECTIONS IN BOTH TEXTS. They will give you a better
perspective on what we are covering and why. As always, email or “message” me with any questions. I check these
at least once per day, but I am NOT online 24/7. AND, if you can help a classmate out, please do so without waiting
for me. If an unanswered question makes you unable to compete an assignment on time, let me know.
Last week we talked about how to collect samples from a population of interest. Random sampling is intended to insure
that we collect data representative of the entire population. The key is RANDOM with EVERY selection having the same
probability of being collected as every other so that no sources of error or bias are introduced. Of course in some less
ethical sampling programs, bias is intended and counted on to sway unknowing decision makers (e.g. political ads).
So, how do we spot BIAS if we don’t know how the data were collected? Often, we must read opposing studies that
may have their own bias in them; then, WE can balance that bias (critical thinking) and come up with our own
conclusions. However, we don’t really look at these raw data, rather we review STATISTICS calculated FROM those
data. Unfortunately, these same statistics can be calculated from biased data just as easily as from truly representative
data. The math is the same. The validity, meaning usefulness to us, isn’t (unless we are the ones fudging the analyses).
OK, just what statistics are we talking about? Just three: MEANS, VARIANCES, AND STANDARD DEVIATIONS. With
properly collected samples the means and variances of the sample data are good predictors (unbiased) of population
means and variances. The standard deviations are not as valid (biased).
Just how well do our sample data statistics reflect the true population’s statistics? This is where PROBABILITY comes in.
Do we want to be (or can we be) 90% certain? 95% or even 99% ? Yes. As you might imagine, sample size has a great
effect on our confidence level.
That’s all there is to statistics. We collect data from samples of the population, calculate means, variances and/or
standard deviations and then try to make probability-based statements about the population in regard to the parameter
being analyzed: health, wealth, effectiveness, life span, etc. NOW, onto the homework problems. Each weekly HW
assignment will have 10 problems and they are based on the material in the Lane and Illowsky text chapters assigned.
So, let’s start the HOMEWORK with MEANS and how they are calculated.
PROBLEM #1. Derivation of MEANS
(a) Write down 30 numbers between 0 and 100. There can and SHOULD be some REPEATS (more than one, two or
more of the same number, but not all the same). DO NOT TRY FOR A PATTERN. DO NOT PUT THEM IN ORDER, AND DO
NOT USE A RANDOM NUMBERS GENERATOR OR TABLE, BUT DO HAVE REPEATS. FOR EXAMPLE: 3, 67, 67, 67, 1, 23,
67, 45, 3, . . . DO NOT CALCULATE ANY STATISTICS FOR THIS DATA SET NOW.
(b) Open the sheet of GRAPH PAPER that is attached to this Assignment, then PLOT your 30 data points on the graph
carefully.
Graphing review: the horizontal axis of a graph is the x-axis and represents the INDEPENDENT variable, in this case it
is simply data point 1, data point 2, . . ., data point 30. The scale on this x-axis needs only go from 0 to 30. The
vertical axis is the y-axis and represents the DEPENDENT variable, in this case the 30 numbers you wrote down
corresponding to your 30 data points, and the scale on this y-axis will go from 0 to 100.
(c) Look at the pattern of your data points. EYEBALL and DRAW a STRAIGHT (use a ruler) horizontal line on the graph
where you think the mean (average) value is (don’t calculate it)
(d) NOW, CALCULATE the MEAN (average) of your 30 data points and DRAW a line representing it on the graph. How
close were you? Let’s see.
(e) MEASURE and RECORD the distances from each of your 30 data points to the MEAN APPROXIMATION LINE YOU
DREW AND THEN THE DISTANCES TO THE ACTUAL MEAN. Measurements ABOVE the lines are positive distances and
those below the line being negative distances. ADD EACH SET (your mean and the true mean) OF THOSE + AND –
DISTANCES TOGETHER AND WHAT TOTALS DID YOU GET FOR YOUR MEAN AND TRUE MEAN? EXPLAIN WHAT THIS
SHOWS THAT THE MEAN REALLY REPRESENTS? THINK ABOUT IT. THAT’S IT FOR THIS PROBLEM. You should now have
an idea of what a MEAN really shows.
PROBLEM #2. DERIVATION OF VARIANCES AND STANDARD DEVIATIONS
Open the attached Excel file: “ VARIANCE” and use it to answer the following questions
In Question #1 we calculated means. The VARIANCE file shows three data sets with the SAME MEAN (7). BUT, the data
plots certainly don’t look the same. How do we account for the differences? We use Variance and its square root, the
Standard Deviation.
The mean tells us only part of the picture of a data set. Note that the mean is the same for all three of these data sets,
but the graphs look quite different. How do we explain that with numbers ? In the exercise about means, you measured
the distances from each data point to the true mean, but the result was not really useful for describing the point spread,
was it?. The + distance totals for these current three data sets would give the same result (try it if you are not sure).
BUT, these distances ARE the key, and adding them together is the approach, but how do handle the NEGATIVE
distances? We SQUARE them. Squaring a negative number ALWAYS gives a positive product: – 3 x -3 = + 9 . SO, this
is what we do. We SQUARE ALL of the ten distances, some of which are already positive and some negative, and THEN
we add those now all positive distances up.
Next, how do we average them? For the mean of ten numbers we added those 10 distances to the TRUE mean and
divided by 10 (the number of data points in a sample data set is referred to as “n”). So, is this how we average the
squared distances? NO.
This may be a little confusing, but since we already know the mean of our ten numbers, only 9 of the squares can vary as
they like, but the 10th must be fixed. For example, if I give you 5 numbers: 3, 6, 7, 9, ___, but tell you the total is 30,
then the missing number MUST be ____? AND, if I told you that the mean of these 5 numbers was 6, you would get the
same value of the fifth number (6 * 5 = 30). Long story short, FOR SAMPLES we divide the sum of the squared distances
by (n – 1) which would be (10 – 1) = 9 in these 10-number data set. (Later, you will see that for entire POPULATIONS, we
would divide by simply N (the small “n” is for samples). This subtraction in later chapters becomes the “Degrees of
Freedom”.
(a) FOR EACH of the three data sets, determine the distance of each data point from the mean,
(b) SQUARE those distances (to make them all positive numbers),
(c) ADD them up and DIVIDE that total by (n – 1) = 9 .
(d) Write these THREE numbers as the first part of your answers to this problem. THESE ARE THE VARIANCES FOR OUR
THREE DATA SETS. The units attached to these numbers are kind of meaningless, for example if our data are numbers of
minutes, then the variances would be “square minutes”.
(e) FINALLY, the STANDARD DEVIATION IS SIMPLY THE SQUARE ROOT OF THE VARIANCE. WHAT ARE THOSE THREE
NUMBERS? Keep in mind that the SD goes both ways, + and – from the mean just as our data points did since the
square root of any number can be + or – (e.g., √16 = + 4 since +4 x +4 = 16 and -4 x -4 also = 16). You can see that the
units now make more sense ( √ minutes2 = simply “minutes” ).
(f) What does knowing the STANDARD DEVIATION of a data set tell you about those data? What would you say about
your confidence (or lack of) in the data based on the SD? Explain (last part of this problem).
KNOW THIS: For many REAL data sets (but not all-as there are conditions that apply) approximately 68% of the data
points are within plus or minus one standard deviation from the mean, about 94% are within + two SD’s of the mean
and 99% are within + three SD’s of the mean. REMEMBER THIS. For example if the mean is 10 and the SD is + 3,
then in theory we could expect 68% of our data points to be between 7 and 13.
PROBLEM #3: (a) YOU NOW KNOW HOW TO CALCULATE THE THREE MAJOR STATISTICS OF A DATA SET: MEAN,
VARIANCE & STANDARD DEVIATION. SO, FOR YOUR 30-POINT DATA SET CALCULATE THE VARIANCE AND STANDARD
DEVIATION FOR THE TRUE MEAN (NOT FOR YOUR APPROXIMATE MEAN). YOU ALREADY HAVE THE DISTANCES FROM
THE TRUE MEAN TO YOUR 30 DATA POINTS, SO SQUARE EACH OF THEM, THEN ADD THOSE SQUARES UP AND THIS IS
THE VARIANCE. TAKE ITS SQUARE ROOT AND YOU HAVE THE STANDARD DEVIATION.
PROBLEM #3: (b) There are other statistics and displays used for various purposes that we can calculate or generate
for data sets: medians, modes, quartiles, IQR’s and outliers. Additional displays include the “5-Number Summary” and
the BOX PLOT. These are still all DESCRIPTIVE STATISTICS, which simply apply to sample data sets. Later, when we use
our SAMPLE data to predict things about the POPULATION we will be using INFERENTIAL STATISTICS, which involves
probability, BUT the means and variance are the keys to these topics as well. FOR YOUR 30-POINT DATA SET:
calculate these additional statistics: the median, mode, and range
MOVING ON TO ILLOWSKY CHAPTER 1
PROBLEM #4 (like Illowsky #80): NOW, Let’s start to make the transition from Descriptive to Inferential
statistics.
This problem deals with FREQUENCIES, RELATIVE FREQUENCIES AND CUMULATIVE RELATIVE FREQUENCIES.
This is the introduction to the PROBABILITY (what are the odds) part of Statistical Analyses.
We will use the 30 data points you generated in Problem #1. What you are going to do should make sense
when you keep in mind that most data sets are huge.
FIRST: We simplify large data sets by grouping those data. For your data let’s use a group size of 10. So, the
groups would be: 0 to 10, 11 to 20, 21 to 30, etc., up to 91 to 100. Each range MUST BE THE SAME. Now,
group YOUR 30 data points by simply counting the number of YOUR data points in each of these ranges. If you
have none in one or more ranges, put in zero. Fill in the following Table with ALL missing values:
GROUPS
0 – 10
11 – 20
21 – 30
FREQUENCY
RELATIVE
FREQUENCY
CUMULATIVE
REL FREQ
91 – 100
(a) Determine the FREQUENCIES as described above (simply the number of your data points in that range or
group).
(b) Calculate the RELATIVE FREQUENCIES for each group (simply the number in a particular group divided by
the total number of data points – here it’s 30)
(c) Calculate the CUMULATIVE RELATIVE FREQUENCY as you go down the las t column. (simply add up the
Relative Frequencies as you go down the last column. MAKE NOTE OF YOUR FINAL NUMBER IN THE BOTTOM
RIGHT CORNER BOX. THIS VALUE IS VERY IMPORTANT AS A CHECK.
NOTE: The frequency is a whole number. The relative frequency (and cumulative rel freq) when first set
up are fractions (e.g., 6/30) then calculated decimal fractions like 6/30 = 0.20, which can be converted to
a percentage 0.20 x 100% = 20%. Get used to these versions.
NOTE that a RELATIVE FREQUENCY TABLE for data points (lowest to highest) gives the percent of data at or
below that data point. THIS IS IMPORTANT TO RECOGNIZE.
PROBLEM #5. (a) PLOT (by hand is fine) a HISTOGRAM (like a bar chart BUT the columns are wide and touch)
with the Ten data GROUPS you created along the x-axis and the FREQUENCIES FOR EACH GROUP UP THE YAXIS. (Don’t use Excel unless you know how to reduce the “gap” between columns to get them to touch).
(b) Do a second HISTOGRAM for the RELATIVE FREQUENCIES. COMPARE THESE TWO PLOTS AND EXPLAIN
THE DIFFERENCES IN WHAT THEY LOOK LIKE AND TELL US ABOUT THE DATA SET.
(c) LOOK at the plots – Do they look like a BELL CURVE (the Normal Distribution) OR are they SKEWED
(positive or negative)?
(d) Sketch a plot that has a positive skew and one that has a negative skew (recognizing skewness has been a
typical final exam question, so know what they look like)
PROBLEM #6. Interpreting Frequency Distributions (Like Illowsky #84): Using YOUR set of 30 data points:
a. What is the frequency for your data between 21 and 59? (This could also be stated as between 20 and 60)
b. What percentage of your data are 71 or greater?
c. What is the relative frequency of your data under 50? (Does not include the number 50 since it’s not
“under” 50)
d. What is the cumulative relative frequency for your data less than 40?
PROBLEM #7. (Like Illowsky #90):
We have 30 students enrolled in this class: 1/3 live outside the U.S., 20% have a slow internet connection,
and 12 are under age 30. Compute the following:
a. How many and what % of students live outside the U.S. ?
b. How many and what fraction (e.g., ½, 1/6, ?) and what decimal fraction do NOT have a slow internet
connection?
c. How many and what percent are OVER age 30
d. IF you were to write the ages on 30 pieces of paper, put them in a hat and blindly draw one out, what are
the chances (probability) that age would be under 30 years? (This is where we go next with our studies)
PROBLEM #8. PERCENTILES (Related to LANE C-1 – Formula on pages 29-31)
We will use the FORMULA approach (Lane’s third definition) to calculate PERCENTILES:
Rank = percentile/100 * (n+1) AND if the Rank ends up with a fraction (FR in Lane), e.g.
6.4, multiply that fraction (0.4) times the difference between the actual data points with
the ranks above and below 6.4 which in this case are the data points with ranks 6 and 7.
Add the result to the lower data point value.
(a) RANK ORDER your 30 data points from LOWEST to HIGHEST.
(b) Look at the list and “EYEBALL” (don’t calculate) what you think is the 25 th percentile (this is the FIRST
QUARTILE or Q1), then the MEDIAN (this is the 50th percentile or Q2), and finally the 75th percentile (the
THIRD QUARTLE or Q3). (You can also try using Lane’s first and second definitions of “percentile”)
(c) NOW, using the FORMULA, calculate these three QUARTILES: Q1, Q2 and Q3. How close did you come?
(d) Calculate the Inter-Quartile Range (IQR) which is simply Q3 – Q1.
(e) FINALLY, complete a 5-Number Summary and BOX PLOT for YOUR data. LABEL OR DESCRIBE WHAT EACH
OF THE PARTS OR LINES IN THE BOX PLOT TELL US ABOUT YOUR DATA SET.
PROBLEM #9. REVIEW: If you are in the 75th PERCENTILE this means that 75% of the scores (data points) are
below yours. To actually enroll to be on-campus at many large State Universities (like Penn State) you need to
be in the top 10% of your high school graduating class. What percentile would this be?
We have been calculating PERCENTILES, but what if we want to know just what percentile a given data point
actually is? Here is that formula: Percentile = [ (x + 0.5 y) / n ] x 100%
PERCENTILE = the number of data points BELOW the given point (x) + 0.5 times the number of duplicate
scores at that given data point (y) all divided by the total number of data points (n) and that result times 100%
gives us the percentile.
Here are 15 final exam scores: 56, 67, 30, 76, 89, 54, 80, 76, 67, 92, 92, 67, 67, 95, 80
(a) What percentile are the scores of 80 and 54 ?
(b) If you need to be in 80th percentile to pass, what score is that? (other formula)
PROBLEM #10. The last concept we want to cover is “rare events” or “unusual values”. How do we identify
them? This does NOT mean that if a data point is “unusual” we can simply delete it. Any deletions MUST
be justifiable. Imagine if a pharmaceutical company deleted the one death in 100 patients taking a new
medication because it was “unusual”. Want to take that pill?
An UNUSUAL data point can be characterized two ways (for us):
(1) Any data point that is more than TWO (2) standard deviations ABOVE OR BELOW the mean.
(2) Any data point that is more than Q3 + (1.5 * IQR) ABOVE OR BELOW the mean.
CALCULATE the “unusual” thresholds using BOTH of the above methods and then LOOKING at YOUR data
check see if your highest and lowest data points are “Unusual”. Do you have MORE than one unusual data
point?
Next week we move into INFERENTIAL STATISTICS, which allows us to test hypotheses about a
POPULATION, based on samples taken from that population. This is what statistics is all about. It starts
with PROBABILITY, meaning that life is a “crap shoot”. Failing to realize that we have the famous last
words: “What could go wrong?”
BONUS (1/2 pt)
Here is a complex math problem, but the equation is one that you WILL need to use to solve a BINOMIAL
problem. TRY IT AND IF THE ANSWER IS CORRECT EARN THE 0.5 BONUS POINT.
Find P(x), which means the Probability of (x) given that N = 10, x = 3, q = 0.4,
(the * sign means multiply, the ! sign means a factorial, for example 5! = 5 * 4 * 3 * 2 * 1 = 120, and an
exponent means multiply the number by itself that many times, for example 3 4 = 3 * 3 * 3 * 3 = 81)
P(x) = {N! / [ x! * (N-x)! ] } * qx * (1-q) (N-x)
The key to solving complex equations is the ORDER in which you do the math. Here is a phrase that gives the
order: “Please Excuse My Dear Aunt Sally”. The first letter tells the order: “P” refers to ( ), [ ] and { }
(parentheses then brackets then braces). You deal with the math inside each of these “containers” in that
order. Now, within each container do the Exponents first, followed by the Multiplications and Divisions and
finally the Additions and Subtractions.
START by plugging the numbers into this equation. Do the calculations in ( ) first, then those in [ ], and finally
those in { }. Now, multiply that result { } times the qx * (1-q) (N-x) and that will be your answer.
USE 4 DECIMAL PLACES (0.0000).
(HINT: the answer is < 1) GOOD LUCK ! ... Purchase answer to see full attachment