Statlib
(CMU)
Documentation
FTP Shortcuts
Top Level
Datasets
- 1993.expo
- Andrews
- CSB (Case Studies in Biometry)
Rest of the mirrored Statlib datasets:
- agresti
- Contains data from "An Introduction to Categorical Data
Analysis," by Alan Agresti, John Wiley, 1996, plus SAS code for
various analyses. (aa@stat.ufl.edu) [28/Feb/96] (12k)
- alr
- This file contains data from Applied Linear Regression, 2nd
Edition, by Sanford Weisberg, John Wiley, 1985
(sandy@umnstat.stat.umn.edu) (36808 bytes)
- Andrews
- This data for the book DATA by Andrews and Herzberg. Available
by FTP, gopher, WWW, but not e-mail.
- backache
- This file contains the `backache in pregnancy' data analysed
in Exercise D.2 of Problem-Solving: A Statistician's Guide, 2nd
edn., by C. Chatfield, Chapman and Hall, 1995.
(cc@maths.bath.ac.uk) [2/Oct/95] (16 kbytes)
- balloon
- A data set consisting of 2001 observations of radiation, taken
from a balloon. The data contain a trend and outliers. Source:
Laurie Davies (mata00@de0hrz1a.BITNET) (43k) [5/Feb/93]
- baseball
- Data on the salaries of North American Major League Baseball
players. The dataset has performance and salary information on
palyers during the 1986 season. This was the 1988 ASA Graphics
Section Poster Session dataset, orgainised by Lorraine Denby.
There are two files to retreive:
- baseball.data
- consists of a shar archive of the data and helpful
information including a description of the data, pitcher,
hitter, and team statistics (54448 bytes)
- baseball.corr
- A set of differences from the published data set (in Unix
diff format)
- baseball.hoaglin-velleman
- Another set of differences from the published data set (in
Unix diff format) See Hoaglin and Velleman, The American
Statistican, August, 1994, page 227--285
- biomed
- I was able to find the old 1982 "biomedical dataset" generated
by Larry Cox. It consists of two groups. These give observation
number, blood id number,age, date, and four blood measurements. I
don't really remember the instructions for analysis, although I
seem to recall that the idea was to figure out if some of the
blood measurements that were less difficult to obtain were as good
at distinguishing carriers from normals as the more difficult
measurements. Unfortunately, I don't remeember which measurement
is which. There are two files to retreive:
- biomed.desc
- a short description of the data and a reference (1457
bytes)
- biomed.data
- A shar archive of containing the data for carriers and
normals. (7843 bytes)
- bodyfat
- Lists estimates of the percentage of body fat determined by
underwater weighing and various body circumference measurements
for 252 men. Submitted by Roger Johnson (rjohnso@silver.sdsmt.edu)
[2/Oct/95](35 kbytes)
- bolts
- Data from an experiment on the affects of machine adjustments
on the time to count bolts. Data appear as the STATS (Issue 10)
Challenge. Submitted by W. Robert Stephenson
(wrstephe@iastate.edu). [8/Nov/93] (5k)
- boston
- The Boston house-price data of Harrison, D. and Rubinfeld,
D.L. 'Hedonic prices and the demand for clean air', J. Environ.
Economics & Management, vol.5, 81-102, 1978. Used in Belsley,
Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980.
(51256 bytes)
- cars
- This was the 1983 ASA Data Exposition dataset. The dataset was
collected by Ernesto Ramos and David Donoho and dealt with
automobiles. I don't remember the instructions for analysis. Data
on mpg, cylinders, displacement, etc. (8 variables) for 406
different cars. The dataset includes the names of the cars. The
data are in one file:
- cars.data
- A shar archive containing files with a desciption of the
cars data, the names of the cars, and the cars data itself.
(33438 bytes)
- cars.desc
- The original instructions for this exposition. (6206 bytes)
- cloud
- These data are those collected in a cloud-seeding experiment
in Tasmania. The rainfalls are period rainfalls in inches. TE and
TW are the east and west target areas respectively, while NC, SC
and NWC are the corresponding rainfalls in the north, south and
north-west control areas respectively. S = seeded, U = unseeded.
Submitted by Alan Miller (alan@dmsmelb.mel.dms.CSIRO.AU)
[4/May/94] (7 kbytes)
- chscase
- A collection of the data sets used in the book "A Casebook for
a First Course in Statistics and Data Analysis," by Samprit
Chatterjee, Mark S. Handcock and Jeffrey S. Simonoff, John Wiley
and Sons, New York, 1995. Submitted by Samprit Chatterjee
(schatterjee@stern.nyu.edu), Mark Handcock
(mhandcock@stern.nyu.edu) and Jeff Simonoff
(jsimonoff@stern.nyu.edu). (325 kbytes) Updated, [1/Dec/95]
- christensen
- Contains the data from "Analysis of Variance, Design, and
Regression: Applied Statistical Methods" by Ronald Christensen
(1996, Chapman and Hall). Ronald Christensen
(fletcher@math.unm.edu), [22/Oct/96] (57k)
- cjs.sept95.case
- Data on tree growth used in the Case Study published in the
September, 1995 issue of the Canadian Journal of Statistics. Nancy
Reid (reid@utstat.utstat.toronto.edu) [4/Oct/95] (141k)
- colleges
- 1995 Data Analysis Exposition sponsored by the Statistical
Graphics Section of the American Statistical Association. The U.S.
News data contains information on tuition, etc., for over 1300
schools, while the AAUP data includes average salary, etc. Robin
Lock, (rlock@vm.stlawu.edu).
- confidence
- This file contains the monthly frequencies for six consumer
confidence items collected by the conference board and the
university of michigan in 1992. Reference in Sociological
Methodology. Submitted by Gordon Bechtel
(BECHTEL@NERVM.NERDC.UFL.EDU) [22/Oct/96] (6k)
- csb
- See the separate csb collection for Data from the book "Case
Studies in Biometry".
- detroit
- Data on annual homicides in Detroit, 1961-73, from Gunst &
Mason's book `Regression Analysis and its Application', Marcel
Dekker. Contains data on 14 relevant variables collected by J.C.
Fisher. (alan@dmsmelb.mel.dms.csiro.au) [10/Feb/92] (3357 bytes)
- diggle
- Data-sets from Diggle, P.J. (1990). Time Series : A
Biostatistical Introduction. Oxford University Press. Submitted by
Peter Diggle, (maa026@central1.lancaster.ac.uk) (35800 bytes)
- djdc0093
- Dow-Jones Industrial Average (DJIA) closing values from 1900
to 1993. See also spdc2693. Submitted by
eduardo ley, (edley@eco.uc3m.es) [13/Mar/96] (383 kbytes)
- econdata
- Directions for obtain a large collection of economic data from
the University of Maryland. [6/Nov/92] (22kb)
- fienberg
- The data from Fienberg's "The Analysis of Cross-Classified
Data", in a form that can easily be read into Glim (or easily read
by a human). [25/Sept/91] (mikem@stat.cmu.edu) (14398 bytes).
- fraser-river
- Time series of monthly flows for the Fraser River at Hope,
B.C. A. Ian McLeod (aim@julian.uwo.ca) [26/April/93] (10 kbytes)
- hip
- This is the hip measurement data from Table B.13 in
Chatfield's Problem Solving (1995, 2nd edn, Chapman and Hall). It
is given in 8 columns. First 4 columns are for Control Group. Last
4 columns are for Treatment group (Note there is no pairing.
Patient 1 in Control Group is NOT patient 1 in Treatment Group).
(cc@maths.bath.ac.uk) [28/Feb/96] (2k)
- hipel-mcleod
- McLeod Hipel Time Series Datasets Collection. The shar file,
mhsets.shar, contains over 300 time
series datasets taken from various case studies. These data sets
are suitable for model building exercises such as are discussed in
our textbook, "Time Series Modelling of Water Resources and
Environmental Systems" by K.W. Hipel and A.I. McLeod (1994),
published by Elsevier, Amsterdam. 1994. ISBN 0-444-89270-2. (1013
pages). For PC users there is also a zip
file, mhsets.zip. The shar file and the
zip files are about 1.7 Mb and 0.5 Mb respectively. [1/Mar/95] Ian
McLeod (aim@fisher.stats.uwo.ca)
- humandevel
- United Nations Development Program, Human Development Index. A
nation's HDI is composed of life expectancy, adult literacy and
Gross National Product per capita. Information on 130 countries
plus documentation. (arnold@stat.ncsu.edu (Tim Arnold))
[31/Oct/91] (10031 bytes).
- irish.ed
- Longtitudinal educational transition data set for a sample of
500 Irish students, with 4 independent variables (sex, verbal
reasoning score, father's occupation, type of school). Submitted
by Adrian E. Raftery (raftery@stat.washington.edu), [20/Dec/93]
(13 kbytes)
- lmpavw
- time series used in "Long-Memory Processes, the Allan Variance
and Wavelets" by D. B. Percival and P. Guttorp, a chapter in
"Wavelets in Geophysics", edited by E. Foufoula-Georgiou and P.
Kumar, Academic Press, 1994 This "time" series was collected by
Mike Gregg, Applied Physics Laboratory, University of Washington,
and is a measurement of vertical shear (in units of 1/seconds)
versus depth (in units of meters) in the ocean. The role of "time"
in this series is thus played by depth. Permission has been
obtained to redistribute this data. Questions concerning this
series should be send to Don Percival (dbp@apl.washington.edu).
[6/Feb/94] (62 kbytes)
- longley
- The infamous Longley data, "An appraisal of least-squares
programs from the point of view of the user", JASA, 62(1967)
p819-841. (therneau@mayo.edu) (1301 bytes)
- newton_hema
- Data on fluctuating proportions of marked cells in marrow from
heterozygous Safari cats--from a study of early hematopoiesis.
Michael Newton (newton@stat.wisc.edu) [8/Nov/93] (5k)
- nflpass
- Lists all-time NFL passers through 1994 by the NFL passing
efficiency rating. Associated passing statistics from which this
rating is computed are included. Roger W. Johnson,
rjohnso@silver.sdsmt.edu [28/Feb/96] (8k)
- nonlin
- The data sets from Bates and Watts (1988) "Nonlinear
Regression Analysis and Its Applications", Wiley. They are in S
dump format as data frames. (If you don't know what a data frame
is, don't worry. Just consider them to be lists. Data frames are
described in a book on "Statistical Modelling in S"
(bates@stat.wisc.edu) [7/Feb/90] (19851 bytes)
- pbc
- The data set found in appendix D of Fleming and Harrington,
Counting Processes and Survival Analysis, Wiley, 1991. Submitted
by therneau@Mayo.EDU (Terry Therneau), [25/Jul/94] (36 kbytes)
- places
- Data taken from the Places Rated Almanac, giving the ratings
on 9 composite variables of 329 locations. (From an ASA data
exposition, 1986) The data are in one file:
- places.data
- A shar archive of three files which document the data,
present the data itself, and provide a key to the actual places
used. (27720 byes)
- pollen
- Synthetic dataset about the geometric features of pollen
grains. There are 3848 observations on 5 variables. From the 1986
ASA Data Exposition dataset, made up by David Coleman of RCA Labs.
The data are in one file:
- pollen.data
- A shar archive of 9 files. The first file gives a short
description of the data, then there are 8 data files, each with
481 observations. (205954 bytes)
- pollen.extra
- Some extra comments about the data. Look here for hints.
- pollution
- This is the pollution data so loved by writers of papers on
ridge regression. Source: McDonald, G.C. and Schwing, R.C. (1973)
'Instabilities of regression estimates relating air pollution to
mortality', Technometrics, vol.15, 463-482. (8540 bytes)
- profb
- Scores and point spreads for all NFL games in the 1989-91
seasons. Contributed by Robin Lock (rlock@stlawu.bitnet)
[15/Sept/92] (27733 bytes)
- prnn
- This shar archive contains the datasets used in `Pattern
Recognition and Neural Networks' by B.D. Ripley, Cambridge
University Press (1996), ISBN 0 521 46086 7
(ripley@stats.ox.ac.uk) [1/Dec/95] (101 kbytes)
- rabe
- This file contains data from Regression Analysis By Example,
2nd Edition, by Samprit Chatterjee and Bertram Price, John Wiley,
1991. (schatter@stern.nyu.edu) [6/Feb/92] (40309 bytes)
- rir
- This file contains data from Residuals and Influence in
Regression, R. Dennis Cook and Sanford Weisberg, Chapman and Hall,
1982. (sandy@umnstat.stat.umn.edu) (5206 bytes). [Updated
25/May/93]
- riverflow
- Datasets mentioned in "Parsimony, Model Adequacy and Periodic
Correlation in Time Series Forecasting", ISI Review, A.I. McLeod
(1992, to appear). Submitted by A.Ian McLeod (aim@stats.uwo.ca).
Time series data. A shar archive. [22/Jan/92] (294052 bytes).
- sapa
- time series used in "Spectral Analysis for Physical
Applications" by D. B. Percival and A. T. Walden, Cambridge
University Press, 1993. (dbp@apl.washington.edu) [4/Nov/92](50788
bytes)
- saubts
- Two ocean wave time series used in "Spectral Analysis of
Univariate and Bivariate Time Series" by D. B. Percival, Chapter
11 of "Statistical Methods for Physical Science," edited by J. L.
Stanford and S. B. Vardeman, Academic Press, 1993.
(dbp@apl.washington.edu) [14/Apr/93] (47 kbytes)
- sensory
- Data for the sensory evaluation experiment in Brien, C.J. and
Payne, R.W. (1996) Tiers, structure formulae and the analysis of
complicated experiments. submitted for publication. Chris Brien
(matcjb@ntx.city.unisa.edu.au) [22/Oct/96] (19k)
- ships
- Ship damage data, from "Generalized Linear Models" by
McCullagh and Nelder, section 6.3.2, page 137. (therneau@mayo.edu)
(1709 bytes)
- sleep
- Data from which conclusions were drawn in the article "Sleep
in Mammals: Ecological and Constitutional Correlates" by Allison,
T. and Cicchetti, D. (1976), _Science_, November 12, vol. 194, pp.
732-734. Includes brain and body weight, life span, gestation
time, time sleeping, and predation and danger indices for 62
mammals. Submitted by Roger Johnson (rjohnso@silver.sdsmt.edu)
[27/Jul/94] (8k)
- smoothmeth
- A collection of the data sets used in the book "Smoothing
Methods in Statistics," by Jeffrey S. Simonoff, Springer-Verlag,
New York, 1996. Submitted by Jeff Simonoff
(jsimonoff@stern.nyu.edu). [13/Mar/96] (242kbytes)
- socmob
- Social Mobility (US, 1973). Two four-way 17x17x2x2 contingency
tables: Father's occupation, Son's occupation (first and current),
family structure, race. Submitted by Timothy J. Biblarz
(biblarz@uscvm.bitnet). [corrected 25/Jan/93]
- spdc2693
- Standard and Poor's 500 Index closing values from 1926 to
1993. See also djdc0093. Submitted by
eduardo ley, (edley@eco.uc3m.es) [13/Mar/96] (333 kbytes)
- stanford
- Two versions of the Stanford Heart Transplant Data, one "The
Statistical Analysis of Failure Time Data" by Kalbfleisch and
Prentice, Appendix I, pages 230-232, the other from the original
paper by Crowley and Hu. (therneau@mayo.edu) (15003 bytes)
[Corrected, 8/Mar/93]
- stanford.diff
- The differences between the two Stanford data sets.
- strikes
- Data on industrial disputes and their covariates in 18 OECD
countries, 1951-1985. Prepared by Bruce Western
(western@datacomm.iue.it) [2/Oct/95] (44k)
- tecator
- The task is to predict the fat content of a meat sample on the
basis of its near infrared absorbance spectrum. Regression.
Submitted by thodberg@nn.meatre.dk (Hans Henrik Thodberg)
[23/Jan/95] (302 kbytes)
- transplant
- Data on deaths within 30 days of heart transplant surgery at
131 U.S. hospitals. see Bayesian Biostatistics, D. Berry & D.
Stangl, eds, 1996, Marcel Dekker. Cindy L. Christiansen and Carl
N. Morris Cindy Christiansen
[22/Oct/96] (3k)
- tumor
- Tumor Recurrence data for patients with Bladder cancer Taken
from Wei, Lin and Weissfeld, JASA 1989, p 1067. From:
therneau@mayo.edu (Terry Therneau) [23/Mar/93] [5/Jun/96] (3k)
- veteran
- Veteran's Administration Lung Cancer Trial, Taken from
Kalbfleisch and Prentice, pages 223-224 (therneau@mayo.edu) (8249
bytes)
- visualizing.data
- This shar file contains 25 data sets from the book Visualizing
Data published by Hobart Press (books@hobart.com) and written by
William S. Cleveland (wsc@research.att.com). There is also a
README file so there are 26 files in all. Each of the 25 files has
the data in an ascii table format. The name of each data file is
the name of the data set used in the book. To find the description
of the data set in the book look under the entry "data, name" in
the index. For example, one data set is barley. To find the
description of barley, look in the index under the entry "data,
barley". The S archive of Statlib has a file created by S that
contains the data sets in a format that makes it easy to read them
into S. (536 kbytes) [12/Nov/93][17/Oct/94]
- wind
- daily average wind speeds for 1961-1978 at 12 synoptic
meteorological stations in the Republic of Ireland (Haslett and
Raftery, Applied Statistics 1989). There is a LARGE amount of
data. Please be sure you want it before you ask for it!! There are
two entries to obtain.
- wind.desc
- A short desciption of the data (815 bytes)
- wind.data
- The data (532494 bytes).
- witmer
- A shar archive of data from the book Data Analysis: An
Introduction(1992) Prentice Hall bu Jeff Witmer. Submitted by Jeff
Witmer (fwitmer@ocvaxa.cc.oberlin.edu) [28/Jun/94] (29 kbytes)
- wseries
- These data tell whether or not the home team won for each game
played in all World Series prior to 1994. The data appear as the
STATS Challenge for Issue 11. Submitted by Jeff Witmer
(fwitmer@ocvaxa.cc.oberlin.edu) [20/Mar/94] (3 kbytes)
- Vinnie.Johnson
- Data on the shooting of Vinnie Johnson of the Detroit Pistons
during the 1985-1986 through 1988-1989 seasons. Source was the New
York Times. Submitted by Rob Kass (kass@stat.cmu.edu) [18/Aug/95]
(26 kbytes)
- submissions
- Information on how to submit data to the original StatLib
archive.