Data, source and data description

Sports analytics is a booming field. Owners, coaches, and fans are using statistical measures and models of all kinds to study the performance of players and teams.   A very simple example is provided by the study of yearly data on batting averages for individual players in the sport of baseball.

The file used here contains 588 rows of data for a select group of players during the years 1960-2004, and it was obtained from the Lahman Baseball Database.

A player’s batting average is the ratio of his number of hits to his number of opportunities-to-hit (so-called “at-bats”).   There are 162 games in the season, and a regular position player (a non-pitcher who is a starter at his position) typically has 4 or 5 at-bats per game and accumulates 600 or more in a season if he does not miss many games due to injuries or being benched or suspended. Most players have batting averages somewhere between 0.250 and 0.300. Because a player’s batting average in a given year of his career is an average of a very large number of (almost) statistically independent random variables, it might be expected to be normally distributed around its hypothetical true value that is determined by his innate hitting ability.

Also, it is reasonable to expect that these hypothetical true values are themselves approximately normally distributed in the population and that they are to some extent “inherited” from one year of a player’s career to the next. Hence we should not be surprised to find that the empirical distribution of batting averages of all players across all years is very close to a normal distribution and that a player’s performance in a given year is positively correlated with his batting average in prior years and hence predictable by linear regression. It is not actually necessary for individual variables in a regression model to be normally distributed—only the prediction errors need to be normally distributed—but the case in which all the variables are normally distributed is the best-case scenario and yields the prettiest pictures.   (The same type of analysis could be done with respect to statistical measures of performance in other sports—say, scoring averages or free-throw percentages in basketball—and qualitatively similar results would be obtained. Regression-to-the-mean is found everywhere.)

Each row in the data file contains statistics for a single player for a single year in which the player had at least 400 at-bats and also at least 400 at-bats in the previous year. The latter constraint was imposed to ensure that only regular players (the best on their teams at their respective positions) were included and also so that the sample size of at-bats for each player was large. The statistics for the analysis consist of batting average, batting average lagged by one year, and cumulative batting average lagged by one year.

The first few rows look like this:


The term “lagged” means “lagging behind” by a specified number of periods, i.e., an observation of the same variable in an earlier period. For example, Hank Aaron’s value of 0.292 for BattingAverageLAG1 in 1961 is by definition the same as the value of BattingAverage for him in 1960.

In general in this file, BattingAverageLAG1 in a given row is equal to BattingAverage in the previous row if the previous row corresponds to the previous year for the same player.