Lecture 1

return to Lectures

 

What is Statistics

Methodology for

Collecting, Displaying, Analyzing Data

Data Detective

Statistics is used for such questions as

What is the relationship between variables?

Is our quality control Adequate?

Can cellular phones cause cancer?

 

Descriptive Statistics

versus

Inferential Statistics

 

This Course

Outline - Seven (7) Topics

Data Collection and Display

Data Summary

Probability Theory

Probability Distribution

Statistical Inference

Analysis of Variance

Regression Analysis

No statistical packages

Excellent Textbook (so far so good!)

Grades

Four (4) Quizzes = four (4) units

Final Exam = two (2) units

Homework and project = two (2) unit

TOTAL = eight (8) units

Note - Final may count as three

(3) units to replace low quiz grade

Posting of Homework (no late HW)

Work on Homework

Office Hours: Tu2-3, W11-12, Th11-12

 

Fundamental Concepts

Statistical analysis is used to learn something about a population

Population - set of units what we are interested in studying

What is the population for the USA census?

 

Characteristic or property of each individual population unit is referred to as variable

What are some variables for the USA census?

 

In many statistical studies are based on using samples

Sample - subset of a population

Is sampling used for USA census?

 

Statistical Inference - generally an estimate or prediction about a population based on information from a sample

 

Measure of Reliability - statement about the degree of uncertainty associated with a statistical inference

 

Data and Data Sources

Variables maybe qualitative or quantitative. Further quantitative variable maybe discrete or continuous

List some qualitative data from USA census

List some quantitative data from USA census

 

Data Acquisition

Experimental versus Observational Studies

Data acquisition is an extremely important aspect of statistical work

"Rubbish In - Rubbish Out"

 

We are surround by statistics which are Junk or, at best, misleading. In some cases, statistics are designed to be misleading.

 

In interpreting statistics always ask   Who, Why, How?

"The Roper Center"

 

Example of Flawed Results

from "Connecticut Labor and Unemployment Statistics for September 1993"

One study - 42,000 jobs gained

Second study - 30,000 jobs lost

Why the difference in result

The two studies are trying to measure the same thing, however, the METHODS used are different

One study was based on a survey of EMPLOYERS and the second a survey of HOUSEHOLDS

 

We are interested in the methods used for data acquisition and the typical pitfalls associated with data acquisition

 

Representative sample

In order to apply inferential statistics the sample for our study must be a representative sample

The most common type of representative sample is Random Sample where every subset has an equal chance of being selected

A sample need not be a random sample to be representative. Other probability sampling may be used

 

Lecture 2

Data Display

Typical Steps for Analyzing Data

Display Data

draw picture so as to better understand data distribution

Summarize Data

Analyzes Data

Many different techniques are available for DATA DISPLAY. The method used depends on the data-set

Qualitative Data (Univariate or Bivariate)

Bar charts

Pie charts

Tabular Displays

Quantitative Data

Univariate

Arrays (small data-sets)

Dot plots (small data-sets)

Stem-and-leaf plots

Frequency Distributions and Histograms and Polygons

Cumulative Frequency Distributions

Bivariate

Scatter plots

Multivariate

Face plots

 

Bar Charts

Conventions for Construction Bar Charts

same width for each bar

space between bars

ranked by order (expect for "others" category)

 

Tabular Display

For bivariate qualitative data

also referred to as cross-classification

format commonly used for bivariate probability distributions

Example: Results of a study of intermural participation at UCONN

 

 

 

 

 

Example: Results of a study of intermural participation at UCONN

What are the variables?

What are the units?

How many units?

Tabular display also used for quantitative data that have been classified

 

 

 

Data Display

 

Data-sets for Examples

Maximum speed in mph of serve of the 16 players in

the round of 16 at the US-Open and at Wimbledon

US-Open

108 112 106 114 109 114 105 115

121 109 102 108 98 115 109 110

Wimbledon

90 102 104 110 93 98 103 108

104 99 97 97 106 101 90 105

 

Stem-and-Leaf Plot

Used to show distribution of data

Back to back stem-and-leaf used to compare distribution of two different data sets

 

 

 

 

 

Frequency Distribution and Histograms and Polygons

Used to show distribution of data

First step is the construction of a frequency distribution

Guidelines for Effective Frequency Distribution

Mutually Exclusive and Exhaustive Classes

Not too few or too many classes (four to twenty generally used)

In general, use same width for all classes

Class limits should be unambiguous and mid-point of class should be close to the class average

Histogram - rectangular graph of a freq. dist.

Polygon - line graph of a freq. dist.

 

 

Construction of Frequency Distribution

determine range

determine class interval

determine number of classes

US Open Example

Range 98 - 121 = 23

Try six (6) classes

23/6 is about 4, use width of 5

Serve Speed Frequency Percentage

95 to <100 1 6.25%

100 to <105 1 6.25%

105 to <110 7 43.80 %

110 to <115 4 25.00%

115 to <120 2 12.50 %

120 to <125 1 6.25 %

 

Histogram

Rectangular plot of frequency distribution

we can plot either frequency or percentage on y-axis

no space between bars

 

Frequency Polygon

Line plot of frequency distribution

 

Stem-and-leaf, Histograms and Polygons

All used to show distribution of data

What are the differences?

What are the pros and cons?

 

 

 

Stem-and-leaf preserves the original data

 

Polygon better than histogram for comparing two different distributions

 

Cumulative Frequency Distribution

 

 

Cumulative Frequency Distribution

Bivariate Data

Two common tools

Scatter plots (for quantitative data only)

Tabular display (quantitative and qualitative)

Data for Scatter Plot

max speed first serve %

98 62

102 57

105 58

106 54

112 57

114 50

 

Scatter Plots

Normal Scatter Plot

 

 

 

 

Symbolic Scatter Plot

 

 

 

 

symbol type represents a third variable

scatter graph now can be used for multivariate data

 

Lecture 3

 

 

Data Summary

Histogram, stem-and-leaf plots are used to show the distribution in the data set - they give a picture of this distribution

For statistical analysis, we need a quantitative picture of the data distribution - numerical parameters for modeling the distribution

 

What are the main features of the distribution curve?

Central Position

Variability

Shape

 

A number of different parameters are used as measures of each of these features

Central Position

mean, median, mode, percentiles

Variability

variance (standard deviation), range

Shape

measure of skewness

relative location of mean, median and mode

 

Mean versus median

What is the difference?

Why is median used for reports on housing prices?

 

Mean

Mostly, commonly the mean refers to the arithmetic average

Geometric mean and harmonic mean are useful for some applications

Median

The middle value of ranked items

i.e... value with rank of (n+1)/2

Example: US-Open

108 112 106 114 109 114 105 115

121 109 102 108 98 115 109 110

 

Median

Example: US-Open

108 112 106 114 109 114 105 115

121 109 102 108 98 115 109 110

Array the items

98 102 102 105 106 108 109 109

109 110 112 114 114 115 115 121

Rank of median item = (n+1)/2 = (16+1)/2 = 8.5

median item is 8.5 in rank

median value is average for 8th and 9th ranked

median value = (109+109)/2 = 109

 

Mode

The mode is the value in the data-set that occur with the greatest frequency

Mode is another measure of position

A data-set may have one or more modes (unimodal Vs multimodal)

Example: US-Open

98 102 102 105 106 108 109 109

109 110 112 114 114 115 115 121

mode?

mode = 109

more useful for a large data-set

 

Shape of Distributions

Distributions maybe discribed as being

Left Skewed

Right Skewed

Symmetrical

Left skewed signifies a distribution with a long left tail

 

What is relative value of the mean, median and mode for

Left Skewed

Right Skewed

Symmetrical

 

Measures of Variability

Is the data concentrated or spread out

Measures

Range

Variability, Standard Deviation and Coefficient of Variation

Range - difference between largest and smallest value

Interquartile Range - difference between third and first quartile value

 

Variance

Harder to calculate but much more useful than range

variance = sum of squares of the deviations / (n - 1)

some calculators use n not (n-1)

Why is sum of squares used?

Variance is measure of average deviation

Variance Calculation

101 101 101 102 102 103 104

average = 102------ n = 7

xi (xi - x bar) (xi - x bar)2 xi 2

101 -1 +1 10201

 

 

sum of (xi - x bar) = 8

variance = 8/(7-1) = 1.33

standard deviation = square root of variance

standard deviation = 1.15

Computational Formula

Many equations have a definitional formula and Computational formula

Computational formula much easier to use for calculations

Variance = (sum of xi squared - (sum of xi)2/n) / (n - 1)

for the example

Variance = (72836 - 7142/7) (7 - 1) = 1.33

Coefficient of Variation = 100 * stand. deviation / mean

Why use?

 

Interpreting Standard Deviation

The rule of thumb are helpful for interpreting standard deviation

Chebyshev’s Rule applies to all distributions regardless of shape or lack of symmetry

Empirical Rule applies only to mound shape, symmetrical distributions

 

Chebyshev’s Rule

for any number, k, greater that one at least (1 - 1/k2) of the measurements fall with k standard deviation of mean

Check to see if this applies to US Open data!

USOpen      mean = 109.3       stdev = 5.9

 

Empirical Rule

at least 68% +/- 1 sd

95% +/- 2 sd

99.7% +/- 3 sd

Does this apply to US Open data? Do you expect it?

Values for normal distribution little less.

 

 

 

 

 

return to Lectures