Wired Magazine has published a provocative article in this month’s issue, The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. The author, Chris Anderson (who also wrote The Long Tail), argues that the growth of massive databases and faster computing has enabled conclusions to be derived from statistical analyses whose quality rivals the traditional scientific method of “hypothesize, model, and test”. More isn’t just more, he asserts: more is different.
The argument is buit from techniques used in traditional data mining, where the variables in the data sets were examined to locate high correlations suggestive of causal relationships. The conclusions were unreliable, though, because the size of the data sets was insufficient to support the number of tests being done. Statisticians like to see about 30 cases of data for each relationship tested. So, for example, if there were 10 variables, there would be 56 unique correlation coefficients to calculate among the variables, requiring a data set of over 1500 cases to yield a trustworthy outcome. Even so, some spurious correlations might be expected, noise which must be eliminated by treating the conclusions as further hypotheses, requiring still more confirming data collection.
This problem is seen most clearly in the work done in bioinformatics, where long sequences of genetic code are compared with one another to measure how similar they are. Which of the sequence at the right, for example, are most similar? What criteria do you should be used to decide? If, for example, the sequences differ in less than 1% of genes are they similar enough create identical organisms? If they differ more than in 5%, are they different enough to explain an inherited disease? Is the data itself reliable enough in all of it’s particulars to have confidence in the unive4rsality of any derived conclusion without further experiment?
Chris believes that the sheer size of the databases overcomes both of these objections. There is sufficient data to overwhelm any random errors that it might contain, and to support very accurate correlation measurements among large numbers of variables. Still, I don’t think that it’s enough to substitute for traditional science.
I’m concerned that any mathematical operations on data sets are deductive: they can’t draw any conclusions that aren’t contained in the original data set. Science often leaps by making inductive hypotheses: those that put a new interpretation to the data that motivates further experiments. Induction is often done through association, drawing together analogous facts from unrelated domains to generate new hypotheses. While these cognitive operations might be simulated in future computer systems, artificial intelligence techniques have not advanced to that level.
Large databases and fast computers have tremendous value in facilitating discovery of relationships in data, and of testing inductive hypotheses. In my work,they act as mediators, help me to reach conclusions faster (and occasionally to get me lost over a wider area). I don’t think that they are able to stand alone as substitutes for the symbiotic interaction between mind and nature that lies at the center of the scientific process.
…there are copious data available, effective tools for retrieving what is necessary to bring to bear on a specific question, and powerful analytic tools. None of this replaces the need for thoughtful scientific judgement.
Lesk, AM (2008) Introduction to Bioinfomatics 3rd ed p31.
Photo credit ProCCKSI Protein (Structure) Comparison, Knowledge, Similarity and Information