While I was out of town, Wired ran a long article about Google cofounder Sergey Brin and his quest to find the genetic causes of Parkinson's disease. There is much of interest here. The piece gives an account of present-day genomic research from a unique point of view.
Brin is a smart person, with a family history of Parkinson's and knowledge that he carries a risk allele. So he is directing a lot of money and attention toward new ways of approaching gene-disease associations. He is one of the major financial backers of the direct-to-consumer genomics company 23andMe, and the husband of founder Ann Wojcicki. Google, of course, has prospered by making unconventional uses of data. That's an approach that many are starting to apply to science:
Increasingly, though, scientists—especially those with a background in computing and information theory—are starting to wonder if that model could be inverted. Why not start with tons of data, a deluge of information, and then wade in, searching for patterns and correlations?
This is what Jim Gray, the late Microsoft researcher and computer scientist, called the fourth paradigm of science, the inevitable evolution away from hypothesis and toward patterns. Gray predicted that an “exaflood” of data would overwhelm scientists in all disciplines, unless they reconceived their notion of the scientific process and applied massive computing tools to engage with the data. “The world of science has changed,” Gray said in a 2007 speech—from now on, the data would come first.
I think that "fourth paradigm" probably overdignifies the approach, which looks like a regression to a naive positivism. As described in the book, The Fourth Paradigm: Data-Intensive Scientific Discovery, the idea is rather more -- a unification of theory and massive amounts of data. Data really do speak for themselves, say Fourth Paradigmers, but they speak quietly with a lot of noise drowning them out. So if you collect vast amounts of data, you have a chance to sort out the whispers of real associations from all the junk.
The article gives a vivid example:
Langston offers a case in point. Last October, the New England Journal of Medicine published the results of a massive worldwide study that explored a possible association between people with Gaucher’s disease—a genetic condition where too much fatty substances build up in the internal organs—and a risk for Parkinson’s. The study, run under the auspices of the National Institutes of Health, hewed to the highest standards and involved considerable resources and time. After years of work, it concluded that people with Parkinson’s were five times more likely to carry a Gaucher mutation.
Langston decided to see whether the 23andMe Research Initiative might be able to shed some insight on the correlation, so he rang up 23andMe’s Eriksson, and asked him to run a search. In a few minutes, Eriksson was able to identify 350 people who had the mutation responsible for Gaucher’s. A few clicks more and he was able to calculate that they were five times more likely to have Parkinson’s disease, a result practically identical to the NEJM study. All told, it took about 20 minutes. “It would’ve taken years to learn that in traditional epidemiology,” Langston says. “Even though we’re in the Wright brothers early days with this stuff, to get a result so strongly and so quickly is remarkable.”
But there are a few stumbling blocks. Unknown associations are relatively weak. Most of the phenotypes are polygenic. With heritabilities less than one, there remain unknown environmental causes of most phenotypes, which are not captured in genetic data and which may interact with different genes.
At present, I don't think anyone in genetics is really operating on a "Fourth Paradigm" level. The massive datasets are building, but few authors are working with existing population genetic theory in ways that would enhance the pattern-matching exercise. If you look through papers describing genome-wide association studies, there are a lot of bivariate statistics, and some multivariate descriptive statistics (like principal components analysis). About all the theory is case-control statistical design. Look through a paper on genetic variation and you're likely to see a STRUCTURE analysis and some coalescent simulations.
The article puts this well:
'We have no grand unified theory,' says Nicholas Eriksson, a 23andMe scientist. 'We have a lot of data.'
Genetic data today are a huge contrast from the past, both in sample size and in coverage. There are a lot of new low-hanging fruit. In the future, the easy stuff will be gone and theory will become more and more important. It remains unclear to me how much progress on health may be made by pattern-matching alone, and how much will require new theoretical advances. Given the problems explaining heritability so far, it may be that we'll need new theory sooner rather than later.
Genetic data are slowly being joined by environment data of various kinds. The article contextualizes the study of environmental variables by telling the story of the initial discovery and long use of aspirin. After it had been common in the population for a long time, researchers started to realize that it had health interactions besides followed by the slow realization that long-time use has health interactions of its own.
The second coming of aspirin is considered one of the triumphs of contemporary medical research. But to Brin, who spoke of the drug in a talk at the Parkinson’s Institute last August, the story offers a different sort of lesson—one drawn from that period after the drug was introduced but before the link to heart disease was established. During those decades, Brin notes, surely “many millions or hundreds of millions of people who took aspirin had a variety of subsequent health benefits.” But the association with aspirin was overlooked, because nobody was watching the patients. “All that data was lost,” Brin said.
The answer is simple: Collect all the data and see what percolates out of them. Heck, probably Google already has enough data about everybody based on their web searches, if they could just connect those to the 23andMe database. If you have a few years of web searches, I wonder just how much that tells you about a person's other phenotypes?
Remember (as the article points out), Google is the company that can predict flu outbreaks faster than the CDC.
Still, aside from the obvious technological progress, I'm a little more sober about the prospects of making rapid health improvements. Consider:
This approach—huge data sets and open questions—isn’t unknown in traditional epidemiology. Some of the greatest insights in medicine have emerged from enormous prospective projects like the Framingham Heart Study, which has followed 15,000 citizens of one Massachusetts town for more than 60 years, learning about everything from smoking risks to cholesterol to happiness. Since 1976, the Nurses Health Study has tracked more than 120,000 women, uncovering risks for cancer and heart disease. These studies were—and remain—rigorous, productive, fascinating, even lifesaving. They also take decades and demand hundreds of millions of dollars and hundreds of researchers. The 23andMe Parkinson’s community, by contrast, requires fewer resources and demands far less manpower. Yet it has the potential to yield just as much insight as a Framingham or a Nurses Health. It automates science, making it something that just … happens. To that end, later this month 23andMe will publish several new associations that arose out of their main database, which now includes 50,000 individuals, that hint at the power of this new scientific method.
Today's sequencing techniques make it much cheaper to do some things that used to be very expensive. But we've done a lot of gold-plated medical studies, and have more coming soon. The most important barrier to progress is not the lack of money; it is the difficulty of altering biological systems without adverse complications.