DATA ANALYSIS
5th Module, 2002-2003
A draft of the course to be offered
at the Applied Economics Track
Professor: Stanislav Kolenikov,
skolenikov@cefir.ru,
skolenik@unc.edu
The purpose of the course is to equip
the students of the AET with the contemporary statistical methods
of data analysis not covered in the standard econometric courses.
The actual coverage may differ depending on the needs of the
students. It might be benefical for them to have started working
on their Master Theses prior to taking this course, so that
they would be able to pick the data analysis methods they would
need for their research, and submit the empirical part of their
theses as a term paper in the Data Analysis course.
References
Aivazian S.A., Mkhitarian V.S. Applied
statistics and essentials of econometrics (in Russian). Moscow,
UNITY, 1999.
T. Hastie, R. Tibshirani, J. H. Friedman.
The Elements of Statistical Learning: Data Mining, Inference,
and Prediction. Springer, 2001.
K. A. Bollen. Structural Equations
with Latent Variables. Wiley, 1989.
D. Huff. How to Lie With Statistics.
Norton, 1993.
F. Mosteller, J. W. Tukey. Data Analysis
and Regression: A Second Course in Statistics. Addison-Wesley,
1977.
StataCorp. Stata statistical software:
Release 7. College Station, TX, US, 2000.
Topics (about a week per topic)
A. Statistical graphics.
Use of graphics in assessing the data is natural on the early
stages of data analysis. The researcher might need to get an
idea of the overall distribution of the data points, the dependencies
within the data set, the presence of trends, groups of objects,
outlying observations, etc, in order to either form the research
hypotheses, or to describe the data. The tools to be discussed
are box-whisker plots, quantile plots, smoothers for density
and regression estimation, and extensions of the usual scatterplots.Uses
and misuses of the graphical tools will also be discussed.
B. Factor analysis and principal
components.
Principal components are used to summarize the data by constructing
a single overall linear index in an efficient, in a certain
sense, manner. They also arise naturally from the need for statistical
graphics to embed the multivariate data set into two dimensions
for plotting. The statistical properties of the principal components
and a number of asymptotic results will also be mentioned.
C. Path analysis
and structural equation models.
This is a more analytic extension of the factor analysis methods
used when the researcher has relatively clear idea, or hypotheses,
about the dependence between different variables in her data
set, and a number of concepts to be related to those variables.The
technique has been developed in social sciences and starts to
find applications in econometrics.
D. Cluster analysis and discrimination.
Cluster analysis seeks to answer the question, Are there distinct
groups of points in the data? This is an important issue, as
blunt application of the standard econometric routines in the
presence of clusters is highly likely to lead to biases in both
point estimates and the standard errors of parameter estimates
thus resulting in incorrect inferences.
E. Other dimension reduction / graphical
techniques
If time permits, some other topics can be covered such as multidimensional
scaling that aims at constructing the lowest dimensional space
for a multivariate set of characteristics; projection pursuit,
aiming at finding a direction in the data that is related to
a particular “feature” such as regression or clustering; or
functional data analysis that works with functions or highly
dimensional objects in place of individual observations.
F. Outlier diagnostics and robust
methods
Outliers are a common problem rather than something unusual
in economic research. Moscow is different from the rest of Russia,
and China is different from other transition countries, etc.
The qualified researcher should be able to isolate the outliers,
or clusters of outliers, and assess their influence on the rest
of statistical analysis. If outliers still carry substantial
information and should not be excluded from analysis, the need
for robust and distribution free methods arises.
G. Sampling and survey data analysis
This topic is rather distant from the earlier parts of the
course. One of the primary sources of statistical information
in business and economics are variuos surveys of population,
enetrprises, customers, etc. It is extremely important to learn
not to be mislead by the numbers reported without reference
to the sampling techniques and sampling problems for the popoulation
of interest. The contemporary sampling schemes will be studied,
and the statistical methods to obtain accurate estimates and
their standard errors will be covered.
Course requirements
The grade of the course will consist
of 20% homeworks, 20% midterm test, and 60% of the term paper.
It may be arranged that the paper be submitted several weeks
after the course is completed, and graded by the end of the
next module, so that students have the full power of the methods
discussed in the course at their disposal before they start
working on the paper. Alternatively, the empirical part of the
Master Thesis can also serve as the term paper for the course.
|