What are data, and why do people care so much about data? Assuming the data are from a ratio or interval scale (scale data,) data are merely numbers attached to things we are interested in. Normally we have no interest in the numbers per se. We are interested in making better decisions or choices, which is facilitated if we have more knowledge. The data are studied in an attempt to acquire this knowledge. The path from data to knowledge isn’t direct. Sometimes we begin with a decision we must make. Should we train our employees? Adjust the machine? Buy this car or that one? If the decisions are obvious and we already know what we need to know, we don’t need to study data and we make the decision directly. In other cases data may help us learn and make it more likely that we’ll make a better decision. At other times we are putting a more general question to the data: how can I improve this process? Or, how can I improve quality?
In both situations we are asking the data to help us formulate testable hypotheses. A reasonable question is “Why can’t I just get my hypotheses directly from the data?” The answer lies in the limits of human cognition. While it’s true that the raw data do contain all the information, we simply don’t have the mental hardware necessary to extract it. It’s too complex. Humans can hold perhaps 7 to 10 pieces of information in their minds at one time. A data set of tens, hundreds or thousands of numbers simply overwhelmes our mental mushware.
First we obtain the data, either by collecting it ourselves or by getting it from some data source. Then we
- QE 21, 4, 369 upper right paragraph. “In our model, hypotheses do not originate in the data, the originate in the confrontation between the data and the mental models that the inquirer entertains.”
- Raw data are too complex for human brains.
- Tables of statistical aggregates lose information that may be vital for EDA. Graphs and pictures are better.
- Salient feature: look for the unexpected. Unexpected implies some reference expectation/distribution.
- Once found, the salient feature can become the new expectation and we look for departures within it.
- Uniform and normal are common reference statistical distributions, but there are others too (e.g., exponential for time data.)
- Assumptions (independence, randomness, etc.) are also reference distributions.
- Data “reveal” what we look for (or expect,) graphs reveal the unexpected.
- Surprise is important (me.)
- Hypotheses are not derived from facts, but invented to account for them.
- What if you see the patterns, but still don’t have a clue?
- Data torture (continue analyzing the data in the hope something pops out.)
- Learn more (study the subject, go and look, collect additional data, etc.)
- Cue acquisition.
- EDA doesn’t prove anything. It’s retrospective (you need prospective) and data based (you need science.) Don’t use EDA for the wrong thing.
- Classical SHIT is a hindrance to EDA and to learning.
- Blind to the obvious (EDA often leads to insights that in retrospect are trivially obvious.) EDA can force the mind to see the obvious.
- The bucket principle: putting the data into buckets, then comparing the buckets. Display data so you can Compare the between bucket variation to the within bucket variation, e.g., boxplots.
- Time is a bucket variable.
- Show individual data dots on x-bar charts to make within-group distributions obvious. (Or use boxplots with control limits?)