Lecture 1
return to Lectures
![]()
What is Statistics
Methodology for
Collecting, Displaying, Analyzing Data
Data Detective
Statistics is used for such questions as
What is the relationship between variables?
Is our quality control Adequate?
Can cellular phones cause cancer?
Descriptive Statistics
versus
Inferential Statistics
This Course
Outline - Seven (7) Topics
Data Collection and Display
Data Summary
Probability Theory
Probability Distribution
Statistical Inference
Analysis of Variance
Regression Analysis
No statistical packages
Excellent Textbook (so far so good!)
Grades
Four (4) Quizzes = four (4) units
Final Exam = two (2) units
Homework and project = two (2) unit
TOTAL = eight (8) units
Note - Final may count as three
(3) units to replace low quiz grade
Posting of Homework (no late HW)
Work on Homework
Office Hours: Tu2-3, W11-12, Th11-12
Fundamental Concepts
Statistical analysis is used to learn something about a population
Population - set of units what we are interested in studying
What is the population for the USA census?
Characteristic or property of each individual population unit is referred to as variable
What are some variables for the USA census?
In many statistical studies are based on using samples
Sample - subset of a population
Is sampling used for USA census?
Statistical Inference - generally an estimate or prediction about a population based on information from a sample
Measure of Reliability - statement about the degree of uncertainty associated with a statistical inference
Data and Data Sources
Variables maybe qualitative or quantitative. Further quantitative variable maybe discrete or continuous
List some qualitative data from USA census
List some quantitative data from USA census
Data Acquisition
Experimental versus Observational Studies
Data acquisition is an extremely important aspect of statistical work
"Rubbish In - Rubbish Out"
We are surround by statistics which are Junk or, at best, misleading. In some cases, statistics are designed to be misleading.
In interpreting statistics always ask Who, Why, How?
"The Roper Center"
Example of Flawed Results
from "Connecticut Labor and Unemployment Statistics for September 1993"
One study - 42,000 jobs gained
Second study - 30,000 jobs lost
Why the difference in result
The two studies are trying to measure the same thing, however, the METHODS used are different
One study was based on a survey of EMPLOYERS and the second a survey of HOUSEHOLDS
We are interested in the methods used for data acquisition and the typical pitfalls associated with data acquisition
Representative sample
In order to apply inferential statistics the sample for our study must be a representative sample
The most common type of representative sample is Random Sample where every subset has an equal chance of being selected
A sample need not be a random sample to be representative. Other probability sampling may be used
Lecture 2
Data Display
Typical Steps for Analyzing Data
Display Data
draw picture so as to better understand data distribution
Summarize Data
Analyzes Data
Many different techniques are available for DATA DISPLAY. The method used depends on the data-set
Qualitative Data (Univariate or Bivariate)
Bar charts
Pie charts
Tabular Displays
Quantitative Data
Univariate
Arrays (small data-sets)
Dot plots (small data-sets)
Stem-and-leaf plots
Frequency Distributions and Histograms and Polygons
Cumulative Frequency Distributions
Bivariate
Scatter plots
Multivariate
Face plots
Bar Charts
Conventions for Construction Bar Charts
same width for each bar
space between bars
ranked by order (expect for "others" category)
Tabular Display
For bivariate qualitative data
also referred to as cross-classification
format commonly used for bivariate probability distributions
Example: Results of a study of intermural participation at UCONN
Example: Results of a study of intermural participation at UCONN
What are the variables?
What are the units?
How many units?
Tabular display also used for quantitative data that have been classified
Data Display
Data-sets for Examples
Maximum speed in mph of serve of the 16 players in
the round of 16 at the US-Open and at Wimbledon
US-Open
108 112 106 114 109 114 105 115
121 109 102 108 98 115 109 110
Wimbledon
90 102 104 110 93 98 103 108
104 99 97 97 106 101 90 105
Stem-and-Leaf Plot
Used to show distribution of data
Back to back stem-and-leaf used to compare distribution of two different data sets
Frequency Distribution and Histograms and Polygons
Used to show distribution of data
First step is the construction of a frequency distribution
Guidelines for Effective Frequency Distribution
Mutually Exclusive and Exhaustive Classes
Not too few or too many classes (four to twenty generally used)
In general, use same width for all classes
Class limits should be unambiguous and mid-point of class should be close to the class average
Histogram - rectangular graph of a freq. dist.
Polygon - line graph of a freq. dist.
Construction of Frequency Distribution
determine range
determine class interval
determine number of classes
US Open Example
Range 98 - 121 = 23
Try six (6) classes
23/6 is about 4, use width of 5
Serve Speed Frequency Percentage
95 to <100 1 6.25%
100 to <105 1 6.25%
105 to <110 7 43.80 %
110 to <115 4 25.00%
115 to <120 2 12.50 %
120 to <125 1 6.25 %
Histogram
Rectangular plot of frequency distribution
we can plot either frequency or percentage on y-axis
no space between bars
Frequency Polygon
Line plot of frequency distribution
Stem-and-leaf, Histograms and Polygons
All used to show distribution of data
What are the differences?
What are the pros and cons?
Stem-and-leaf preserves the original data
Polygon better than histogram for comparing two different distributions
Cumulative Frequency Distribution
Cumulative Frequency Distribution
Bivariate Data
Two common tools
Scatter plots (for quantitative data only)
Tabular display (quantitative and qualitative)
Data for Scatter Plot
max speed first serve %
98 62
102 57
105 58
106 54
112 57
114 50
Scatter Plots
Normal Scatter Plot
Symbolic Scatter Plot
symbol type represents a third variable
scatter graph now can be used for multivariate data
Lecture 3
Data Summary
Histogram, stem-and-leaf plots are used to show the distribution in the data set - they give a picture of this distribution
For statistical analysis, we need a quantitative picture of the data distribution - numerical parameters for modeling the distribution
What are the main features of the distribution curve?
Central Position
Variability
Shape
A number of different parameters are used as measures of each of these features
Central Position
mean, median, mode, percentiles
Variability
variance (standard deviation), range
Shape
measure of skewness
relative location of mean, median and mode
Mean versus median
What is the difference?
Why is median used for reports on housing prices?
Mean
Mostly, commonly the mean refers to the arithmetic average
Geometric mean and harmonic mean are useful for some applications
Median
The middle value of ranked items
i.e... value with rank of (n+1)/2
Example: US-Open
108 112 106 114 109 114 105 115
121 109 102 108 98 115 109 110
Median
Example: US-Open
108 112 106 114 109 114 105 115
121 109 102 108 98 115 109 110
Array the items
98 102 102 105 106 108 109 109
109 110 112 114 114 115 115 121
Rank of median item = (n+1)/2 = (16+1)/2 = 8.5
median item is 8.5 in rank
median value is average for 8th and 9th ranked
median value = (109+109)/2 = 109
Mode
The mode is the value in the data-set that occur with the greatest frequency
Mode is another measure of position
A data-set may have one or more modes (unimodal Vs multimodal)
Example: US-Open
98 102 102 105 106 108 109 109
109 110 112 114 114 115 115 121
mode?
mode = 109
more useful for a large data-set
Shape of Distributions
Distributions maybe discribed as being
Left Skewed
Right Skewed
Symmetrical
Left skewed signifies a distribution with a long left tail
What is relative value of the mean, median and mode for
Left Skewed
Right Skewed
Symmetrical
Measures of Variability
Is the data concentrated or spread out
Measures
Range
Variability, Standard Deviation and Coefficient of Variation
Range - difference between largest and smallest value
Interquartile Range - difference between third and first quartile value
Variance
Harder to calculate but much more useful than range
variance = sum of squares of the deviations / (n - 1)
some calculators use n not (n-1)
Why is sum of squares used?
Variance is measure of average deviation
Variance Calculation
101 101 101 102 102 103 104
average = 102------ n = 7
xi (xi - x bar) (xi - x bar)2 xi 2
101 -1 +1 10201
sum of (xi - x bar) = 8
variance = 8/(7-1) = 1.33
standard deviation = square root of variance
standard deviation = 1.15
Computational Formula
Many equations have a definitional formula and Computational formula
Computational formula much easier to use for calculations
Variance = (sum of xi squared - (sum of xi)2/n) / (n - 1)
for the example
Variance = (72836 - 7142/7) (7 - 1) = 1.33
Coefficient of Variation = 100 * stand. deviation / mean
Why use?
Interpreting Standard Deviation
The rule of thumb are helpful for interpreting standard deviation
Chebyshevs Rule applies to all distributions regardless of shape or lack of symmetry
Empirical Rule applies only to mound shape, symmetrical distributions
Chebyshevs Rule
for any number, k, greater that one at least (1 - 1/k2) of the measurements fall with k standard deviation of mean
Check to see if this applies to US Open data!
USOpen mean = 109.3 stdev = 5.9
Empirical Rule
at least 68% +/- 1 sd
95% +/- 2 sd
99.7% +/- 3 sd
Does this apply to US Open data? Do you expect it?
Values for normal distribution little less.
![]()
return to Lectures