The ENCODE project and function in the human genome

I wanted to find out more about today's publication of the ENCODE catalog and data, and so I turned right away to lead bioinformatician Ewan Birney, who has an excellent blog post about it: "ENCODE: My own thoughts".

I recommend the whole thing, which is an extended Q-and-A format like I often do. The most interesting for people reading science news stories will probably be about the claim that a very large proportion of the genome (up to 80%) is functional. Birney's comments put that number into context:

Q. So remind me which one do you think is functional?
A. Back to that word functional: There is no easy answer to this. In ENCODE we present this hierarchy of assays with cumulative coverage percentages, ending up with 80%. As Ive pointed out in presentations, you shouldnt be surprised by the 80% figure. After all, 60% of the genome with the new detailed manually reviewed (GenCode) annotation is either exonic or intronic, and a number of our assays (such as PolyA- RNA, and H3K36me3/H3K79me2) are expected to mark all active transcription. So seeing an additional 20% over this expected 60% is not so surprising.
However, on the other end of the scale using very strict, classical definitions of functional like bound motifs and DNaseI footprints; places where we are very confident that there is a specific DNA:protein contact, such as a transcription factor binding site to the actual bases we see a cumulative occupation of 8% of the genome. With the exons (which most people would always classify as functional by intuition) that number goes up to 9%. Given what most people thought earlier this decade, that the regulatory elements might account for perhaps a similar amount of bases as exons, this is surprisingly high for many people certainly it was to me!

Even at 8%, the amount of potential regulatory activity in the genome is very large, and this should factor into the way we study recent human evolution. Birney discusses purifying ("negative") selection as one criterion for identifying functional DNA, but of course substantial functional variation might emerge under random genetic drift of such elements in human populations.

Also, he writes about the process of inventing a new kind of publication -- "threads" -- which highlight related tracks across a large set of publications. With 30 papers in the current ENCODE publication release, in multiple journals, tracking a single subject would be complicated for anyone. So they tried to help out:

Threads offer an alternative, lighting up a path through the assembled papers, pointing out the figures and paragraphs most relevant to any of 13 topics and taking you all the way through to the original data. The threads are there to help you discover more about the science weve done, and about the ENCODE data. Interestingly, this is something thats only achievable in the digital form, and for the first time I found myself being far more interested in how the digital components work than in the print components.

The post has a lot of interesting background information about the ENCODE project, the process of coordinating a project with hundreds of scientists, and the conflicts that arose between ENCODE and groups targeting smaller, narrower subjects related to DNA function.

UPDATE (2012-09-05): Dan MacArthur has further thoughts about the influence of the publication model in the paper, with its innovative threaded e-structure and the inclusion of a virtual machine which archives many of the computational approaches: "The ENCODE project: lessons for scientific publication". But he adds an additional note related to openness:

At the same time, it is worth noting the constraints that the standard embargo model of scientific publication have still imposed on the project. Much of the ENCODE data was mature and ready for use 12 months ago, and for those in the know has been a valuable component of functional annotation pipelines. Many of us in the genomics community were aware of the progress the project had been making via conference presentations and hallway conversations with participants. However, many other researchers who might have benefited from early access to the ENCODE data simply werent aware of its existence until todays dramatic announcement and as a result, these people are 6-12 months behind in their analyses.

Even though the ENCODE project followed very open data release policies, we still have much progress to achieve on dispersing information rapidly enough to make a difference to researchers outside these big projects.