# How Do I Compute sigma? Let Me Count the Ways.

For the purpose of discussion, I invented a data set to analyze, as shown in Table 1. The data reflects a process that starts with an average of 10. The process is influenced by a special cause that makes the average increase by 1 every hour. The standard deviation remains constant at 1, and the data follow a normal distribution.
This is the standard approach, used by calculators and spreadsheets any time users require sample sigma. The numerator is the sum of the squared deviations from the sample average — i.e., subtract the sample average from the first observation and square it, then the second, etc. Then add the results. If the data cluster close to the average, this sum will be smaller than if they are scattered more widely. Thus, a bigger value of s indicates a process with greater scatter. The denominator indicates the degrees of freedom. The n – 1 term in the denominator is a bias correction. For a given sample size, the denominator will be a constant. Thus, estimates of s can be compared directly for different processes when sample sizes are the same. Any observed differences can be attributed to data scatter, which may (or may not) indicate different process scatter. Shewhart showed that this traditional estimate of s is only valid when the process is stable. If a process is influenced by a special cause, then this estimate will overestimate the process scatter. For our example, the formula estimates
Using an estimator that doesn’t include the variation between time periods will alleviate the problem of s being inflated by special causes. Shewhart proposed using rational subgroups to do this. A rational subgroup is a sample selected in such a manner that the opportunity for a special cause to influence the results is minimized. This is often accomplished by selecting consecutive units from a process. In Table 1, the data are arranged in 10 subgroups of five measurements per subgroup. The first group of five were sampled in hour No. 1, the next group in hour No. 2 and so forth. The table indicates no change in the process during the time the subgroup was collected, so it’s the ideal from the Shewhart perspective. With these clean subgroups, we can estimate the process dispersion for each subgroup, then combine the results to find the overall estimate of s. One way to estimate dispersion is to find the range,
The range uses only two data values from each subgroup, which poses a problem. In statistical terms, it’s inefficient. That is, the estimates of s based on the range will be more erratic than when subgroup Method 3 works by finding the subgroup
If it isn’t possible or desirable to collect data in subgroups, we can correct for special causes by finding the range between consecutive hourly samples. To get
This method for estimating s is based on the average moving range, . Doing this for the sample data set gives Table 2 summarizes the results of all of these methods. The least accurate result is found when the standard formula is used. For SPC work, this formula should only be used when a control chart shows good statistical control. Despite the fact that, for our example, the result was slightly more accurate than the estimate, the best formula from a statistical perspective is method No. 3. For subgroups of five, the advantage isn’t all that great, but it becomes greater as the subgroup size increases. The method usually comes in second to the method, unless the statistical advantage is outweighed by some practical concern — such as ease of understanding. The moving range methods, while inferior to the subgroup methods, are far better than the standard |

## Comments