john hawks weblog

paleoanthropology, genetics and evolution

computing

  • Stuxnet story

    Thu, 2011-03-03 23:29 -- John Hawks

    Fascinating detective story in Vanity Fair about how computer security researchers ferreted out the workings of the Stuxnet worm. I especially enjoyed the psychological sketch of one of the main figures in the story, who works in industrial control software:

    “If I did not have the background that I had, I don’t think I would have had the guts to say what I said about Stuxnet,” Langner says now, finishing his second glass of wine during lunch at a Viennese restaurant in Hamburg. Langner studied psychology and artificial intelligence at the Free University of Berlin. He fell into control systems by accident and found that he loved the fiendishly painstaking work. Every control system is like a bespoke suit made from one-of-a-kind custom fabric—tailored precisely for the conditions of that industrial installation and no other. In a profession whose members have a reputation for being unable to wear matching socks, Langner is a bona fide dandy. “My preference is for Dolce & Gabbana shoes,” he says. “Did you notice, yesterday I wore ostrich?” Langner loves the attention that his theories have gotten. He is waiting, he says, for “an American chick,” preferably a blonde, and preferably from California, to notice his blog and ask him out.

  • The problems of computer-aided biologists, 1

    Wed, 2010-03-17 18:53 -- John Hawks

    On the subject of modeling in genetics, John Timmer of Ars Technica has been running an excellent series on the challenges of computer models in biology. I'll devote a few words to some of these articles in the next several days.

    An article from earlier this winter, "Keeping computers from ending science's reproducibility," discusses the problems with replicability. Data from genomes and genotyping platforms go through frequent revisions, so that the same methods may lead to different results depending on the version of the dataset. Not replicable, in other words, and it may be very hard to track down exactly why slight differences in results persist. It's also hard to verify that the methods are working the same way when the same results aren't found -- it's not like the problem of significant digits in measurement, in other words.

    That problem is compounded when it comes to analytical methods:

    An analysis pipeline may involve dozens of specialized software tools chained together in series, each with a number of parameters that need to be documented for their output to be reproduced. Like the data, some of these tools are proprietary, and many of them undergo frequent revisions that add new features, change algorithms, and so on. Some of them may be developed in-house, where commenting and version control often take a back seat to simply getting software that works. Finally, even the best commercial software has bugs.

    "Getting it to work" is too often the major goal in human genetics, where in-house development of population history models is the norm. Rigorous validation of these models is beyond any single lab's purview; to be published, it is enough to cite prior art.

    The end of the article includes some reporting on possible solutions, including this:

    Even if we solve the legal and computational portions of the problem, however, we're going to run into issues with the fact that many of the people who use computational tools understand what they do, but don't feel compelled to learn the math behind them. That's where a paper in the latest edition of Science comes in. Its author, Jill Mesirov of the Broad Institute, describes how many biologists aren't well versed in computational analysis, but are increasingly reliant on tools created by those who are; she then goes on to describe one type of solution, called GenePattern, that she and her colleagues put together with the help of Microsoft Research.

    The idea is to "embed" the actual bioinformatic research methods into the paper, as one would embed a spreadsheet into a Word document. That way, anyone who reads the paper could just run an active version of the methods, to verify the results were accurate, and (potentially) play with the parameters.

    Not a bad idea for the toy example, but for simulations that take days or more to run, it isn't going to be practical. What we need is people to learn the math, not people to dumbly click buttons in a paper.

    The specific idea of an interactive workflow is implemented fairly well in the Galaxy bioinformatics platform. There are definite strengths to that approach -- most importantly, for simple operations it can be incredibly useful to have a running record of what you've done, so that you can get it again yourself. But an equivalent record can fairly easily be accomplished using Python, Perl or any other scripting language. A risk of an online system is that it runs into the versioning problem very quickly -- interactive downloads may bring inconsistent datasets that use different genome draft assemblies, for example.

    In any event, much pain can be circumvented with a little math, in many cases. We should make it a priority to get students a common-sense understanding of how genetic parameters relate to each other.

    UPDATE (2010-03-18): Another section of the article is worth discussion. Along the lines of my post from earlier this year regarding the importance of code sharing and transparency ("The bugs will out"), Timmer wrote:

    "You need the code to see what was done," [Victoria Stodden] told Ars. "The myriad computational steps taken to achieve the results are essentially unguessable—parameter settings, function invocation sequences—so the standard for revealing it needs to be raised to that of when the science was, say, lab-based experiment." This sort of openness is also in keeping with the scientific standards for sharing of more traditional materials and results. "It adheres to the scientific norm of transparency but also to the core practice of building on each other's work in scientific research," she said. But the same worries that apply to more traditional data sharing—researchers may have a competitor use that data to publish first—also apply here. In the slides from her talk, she notes that a survey she conducted of computational scientists indicates that many are concerned about attribution and the potential loss of publications in addition to legal issues. (The biggest worry is the effort involved to clean up and document existing code.)

    A lot of the code we use is really rather simple. The coalescent can be implemented in a few lines, and most common alterations of it can be handled with 10-line subroutines. A forward-time simulation can be done in a single line of Python, and again the common alterations don't take too much to implement.

    There are rather radically more complicated models in use, and we should direct more attention to making these human-readable, separating modular elements apart so that they can be run with different simulation engines, and making clear distinctions between functional code, parameters, and data. I've been doing this long enough to know how simple it can be to hard-wire your parameters into the code, undocumented, so that nobody can figure out what is going on but the author. That's not where you want to be.

  • Will Wolfram make bioinformatics obsolete?

    Tue, 2009-03-17 12:33 -- John Hawks

    I was talking with a scientist last week who is in charge of a massive dataset. He told me he had heard complaints from many of his biologist friends that today's students are trained to be computer scientists, not biologists. Why, he asked, would we want to do that when the amount of data we handle is so trivial?

    Now, you have to understand, to this person a dataset of 1000 whole genomes is trivial. He said, don't these students understand that in a few years all the software they wrote to handle these data will be obsolete? They certainly aren't solving interesting problems in computer science, and in a short time, they won't be able to solve interesting problems in biology.

    I said, well, yeah. I've been through this once already -- fifteen years ago, the hot thing was setting up a wet lab for sequencing -- or worse, RFLP. That sure looked like a lot of data at the time, and a lot of students spent a lot of time figuring out how to do it. Some of them successfully started careers, got grants, and moved on with the times. Others fell by the wayside. Meanwhile, clusters of people at the DOE, Whitehead Institute, Wellcome Trust and several private companies were spending their time figuring out faster and faster ways of automating sequencing. Now one machine can do the work of ten thousand 1990's graduate students.

    Anyway, I've was thinking about that conversation. And then I ran across an article by Nova Spivack, describing the new Wolfram Alpha.

    Stephen Wolfram is building something new -- and it is really impressive and significant. In fact it may be as important for the Web (and the world) as Google, but for a different purpose. It's not a "Google killer" -- it does something different. It's an "answer engine" rather than a search engine.

    ...

    Wolfram Alpha is a system for computing the answers to questions. To accomplish this it uses built-in models of fields of knowledge, complete with data and algorithms, that represent real-world knowledge.

    For example, it contains formal models of much of what we know about science -- massive amounts of data about various physical laws and properties, as well as data about the physical world.

    Based on this you can ask it scientific questions and it can compute the answers for you. Even if it has not been programmed explicity to answer each question you might ask it.

    This sounds very pie-in-the-sky. And indeed, commenters on the article (as well as this article by Cycorp head Doug Lenat) come up with lots of questions that would be impossible for such a system to answer.

    But I'm not really interested in the things that will stump the system. Compared to restaurant reviews and kinship systems, bioinformatics is pretty simple. Right now, there are two things that make it a multi-year effort to learn: mutually incompatible databases, and the various kludges necessary to model ascertainment bias.

    I'm a Mathematica user, and am familiar with its theorem-proving capabilities. Mathematica already has genome lookup utilities, which I use quite often -- it's just easier to do a lookup on my own system than to plow through two or three webpages to get to the query. It really wouldn't take that much to bring intelligent and interactive genome analysis into the system.

    Alpha could turn into an online robot armed with basic genetics knowledge. And if not Alpha -- genetics is a logical priority for Wolfram, but it may not be the first or primary one -- certainly some other system using similar technology will emerge. Put it to work on public databases of genetic information, and you have a system that can resolve the incompatibilities by adding semantic knowledge. A bit of effort on existing databases would allow the resolution of discrepancies in ascertainment. Or, more likely, another couple of years of whole-genome sequencing will solve most of ascertainment biases by drowning them in new data.

    So it's not a stretch for me to imagine a year from now entering this search query:

    "List all human genes with significant evidence of positive selection since the human-chimpanzee common ancestor, where either the GO category or OMIM entry includes 'muscle'"

    It seems to me that bioinformatics is what generates the output to that query. What you do with the output of that query is evolutionary biology.

    So that raises the obvious question. Tomorrow's high-throughput plain-English bioinformatics tool will do the work of ten thousand 2009 graduate students. If a freely-available (or heck, even a paid) service can do the bioinformatics, what should today's graduate students be learning?

    UPDATE (2009-03-19):

    Some folks have interesting reactions to this post, including Thomas Mailund and Dan MacArthur. They make good points.

    I will add that I'm not arguing against modeling or simulation in biology. There are lots of interesting things in evolutionary biology you can do -- must do, in all practical terms -- with computers. But I don't like the five-year degree program in genetics where only one semester is given to population genetics, and most of the student's time is spent learning scripting, doing data entry, and figuring out ten or twelve database formats.

    I come back to my first example -- fifteen years ago, people were telling you how essential and wonderful sequencing would always be. If you're pursuing a five-year degree program and two or three years of postdoc, I hope you're thinking about what skills you'll need fifteen years from now.

Subscribe to computing

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.