john hawks weblog

paleoanthropology, genetics and evolution

sequencing

  • The Mayflower criminal registry

    Fri, 2012-01-13 22:25 -- John Hawks

    Of some interest with respect to DNA databases and privacy concerns: "DNA links 1991 killing to Colonial-era family".

    The DNA sample was taken in the death of 16-year-old Sarah Yarborough, who was killed on her high school campus in Federal Way, Washington, in December 1991. The King County Sheriff's Office has circulated two composite sketches of a possible suspect -- a man in his 20s at the time with shoulder-length blonde or light brown hair -- but had been unable to put a name to the sketch.

    In December, though, the department sent the DNA profile to California-based forensic consultant Colleen Fitzpatrick. Fitzpatrick compared the profile to others in genealogy databases and found the closest match was to the family of Robert Fuller, who settled in Salem, Massachusetts, in 1630 and had relatives who came over before him on the Mayflower.

    This is a Y chromosome match based on the genealogical research of people who may be completely unknown to the "suspect". Fitzpatrick offers that a Y-chromosome match may be expected to share a surname, which is probative in the forensic situation. Obviously there are many possible scenarios in which such information will not lead to discovery of a suspect: the chance of non-acknowledged paternity events across 200 years is very high. I don't view the result as strongly actionable, but I do think it raises important questions about the future of genealogy databases.

    We are near the time when whole-genome sequencing will make this kind of identification much more likely because unique genetic matches to 3rd and 4th degree relatives will be plausible. Finding a handful of rare mutations shared between a crime scene sample and an individual in a whole-gneome database would be a strong indication of a relationship. It's possible that the databases for whole genomes will grow faster than the technology will allow reliable whole-genome sequencing from a crime scene sample. So in this case, the issues with database use may be primary.

    It would be an interesting exercise to estimate the fraction of unknown samples from crime scene Y chromosome and mtDNA that could be matched to a 10th-degree relative in the Genographic (or any other large) dataset.

  • Sequencing is outpacing computing

    Wed, 2011-11-30 23:36 -- John Hawks

    The New York Times notices DNA sequencing's Malthusian trap: "DNA sequencing caught in deluge of data."

    That is a decline [in sequencing costs] by a factor of more than 800 over four years. By contrast, computing costs would have dropped by perhaps a factor of four in that time span.

    The lower cost, along with increasing speed, has led to a huge increase in how much sequencing data is being produced. World capacity is now 13 quadrillion DNA bases a year, an amount that would fill a stack of DVDs two miles high, according to Michael Schatz, assistant professor of quantitative biology at the Cold Spring Harbor Laboratory on Long Island.

    I have spoken with several scientists in other fields, like astronomy and particle physics, who deal with truly big datasets. Until now, biology data has actually been pretty small potatoes compared with the sheer amount pumped out by large projects in other fields. But that's changing. The Times article points out a unique aspect of the data problem in genetics: There are now thousands of labs that can generate large datasets, many of whom have no special plan for data archiving or availability.

    “Google has enough capacity to do all of genomics in a day,” said Dr. Schatz of Cold Spring Harbor, who is trying to apply Google’s techniques to genomics data. Prodded by Senator Charles E. Schumer, Democrat of New York, Google is exploring cooperation with Cold Spring Harbor.

    Google’s venture capital arm recently invested in DNAnexus, a bioinformatics company. DNAnexus and Google plan to host their own copy of the federal sequence archive that had once looked as if it might be closed.

    I don't see Google as a deus ex machina for this one -- although I do observe that several other big data projects are sponsored by large Microsoft investors or founders.

  • Sequence the old, fast

    Wed, 2011-10-26 10:13 -- John Hawks

    The Archon Genomics X Prize is a $10 million contest to see what company or organization can develop a low-cost accurate sequencing technology. The AP's Malcolm Ritter reports that the testbed sequences will be 100 centenarians ("Secrets of long life sought in DNA of the elderly"), which is a pretty interesting test cohort.

    Protective features of a centenarian's DNA can even overcome less-than-ideal lifestyles, says Dr. Nir Barzilai of the Albert Einstein College of Medicine in New York. His own study of how centenarians live found that "as a group, they haven't done the right things."

    Many in the group he studied were obese or overweight. Many were smokers, and few exercised or followed a vegetarian diet. His oldest participant, who died this month just short of her 110th birthday, smoked for 95 years.

    "She had genes that protected her against the environment," Barzilai said. One of her sisters died at 102, and one of her brothers is 105 and still manages a hedge fund.

    I doubt they'll be able to explain much of the variance in longevity with 100 genomes, but they'll surely find some things that make a small difference and will lead to a newsworthy outcome. Larger samples will find more of the genetic pathways that influence lifespan, as will adding a wider range of elderly samples from other populations.

  • Exome sequencing as a stopgap

    Fri, 2011-10-14 12:09 -- John Hawks

    The new Genome Biology has a perspective piece by Jacob Tennessen and colleagues, titled "The promise and limitations of population exomics for human evolution studies" [1]. Exomics is the study of the coding part of the genome, which is only 30 megabases as opposed to the 3 gigabases of a whole genome. Today it is possible to apply methods that sequence only the protein-coding parts of the genome, by combining methods that capture such regions with next-generation sequencing. The result is vastly cheaper than a whole genome, and some of this cost savings can be applied to increase the coverage, which increases the sequence accuracy.

    Tossing away 99% of the genome is not an ideal sampling strategy for many purposes. However, when it comes to phenotype prediction, we can make some predictions about how changes in amino acid sequences will affect protein function. Many important phenotypic changes are caused by non-coding variations in gene regulation, but genetics has not yet reached a state of knowledge where these can be readily predicted. So, if we're sequencing people's genomes for the purposes of finding disease or phenotype variants, exome sequences give much of the information that we can presently evaluate.

    James Hadfield noted the spree of exome sequencing publications at his blog, Core Genomics ("Exome capture comparison publication splurge"). He tags the rationale for

    A lot of people I have talked to are now looking at screening pipelines which use Exome-Seq ahead of WGS to reduce the number of whole Human genomes to be sequenced. The idea being that the exome run will find mutations that can be followed up in many cases and only those with no hits can be selected for WGS.

    I have heard a number of geneticists looking at exome sequencing as an intermediate step in population genetics, a way to increase the size of samples more affordably than whole genome sequencing makes possible at present. I don't think this will last long, as whole genomes offer much more for population genetic analysis and are rapidly dropping in price, but that depends on how technology develops. If we are consistently in the situation where researchers can multiplex 50 exomes at high coverage for the same price as one whole genome, it may make sense to use that strategy for a long time.

    23andMe is starting an exome sequencing project. Daniel MacArthur's comments on G+ and the subsequent reader comments are interesting.


    References

  • Personalized genomics beats personalized genetics

    Fri, 2011-09-16 01:00 -- John Hawks

    Joe Pickrell encountered sticker shock when faced with the prospect of a medical sequencing test: "The week that I worried I had a rare genetic disease".

    What’s really striking to me is that the price of whole genome sequencing is already competitive with commercial Sanger sequencing tests of individual genes.

    Amazing how much patent-laden (and labor-intensive) sequencing work can charge to insurance.

  • Delete the troubling data

    Thu, 2011-04-14 14:00 -- John Hawks

    Misha Angrist turns on the sarcasm filter for a proposal to discard raw data that may trouble research subjects ("If you want to destroy my sweater"):

    Pay attention, kids: If it poses an ethical problem, then the obvious thing to do is to just throw it away! Delete it! Burn it! Shred it! Avert your eyes! The patient/research participant/taxpayer won’t mind! Trust me!

    This is so annoying. It's cheaper in many contexts to do genome-wide genotyping than assay specific gene variants. So we'll increasingly see gene testing done on whole-genome platforms of various kinds.

    But doctors don't order clinical tests for whole genomes, they order particular genetic tests. It's an obvious strategy for a testing company to provide only the ordered results, and either retain or discard further data, in the hopes of additional sales later. The company can upsell its "filtered" service as including additional validation or additional interpretive information of the kind that software can automatically add (for example, short-range phased haplotypes).

    Angrist references a suggestion from an academic paper that a subject's APOE status should be blindly deleted from such results, to avoid the necessity of informing the subject about Alzheimer's risk.

    This is the sort of thing we need to be thinking harder about -- how to alert unsuspecting people to minor or moderate risks that will be routine in whole-genome data.

  • Genomes too cheap to meter

    Wed, 2011-01-12 00:03 -- John Hawks

    Matthew Herper is a science and medicine contributing writer at Forbes.com. He has just written a series of posts themed as "Gene Week", focusing on advances in genomics. One of the most provocative, "Why You Can’t Have Your $1,000 Genome", focuses on the hidden costs of interpretation and high-coverage necessary for clinical use of genome data.

    His argument is that even if the cost of sequencing a low-coverage genome goes to $1000, the true cost of using the data will remain much higher:

    Great buzzword, but it may never happen, especially not any time soon and especially not at a cost of $1,000. Research costs for sequencing a human genome may drop that low very soon, but that doesn’t include paying the doctors or the cost of information technology to process the data. Research genomes are not accurate enough for medical use. Getting better accuracy requires sequencing the DNA more times, which drives the cost back up. I’d think if we’re talking about actual medical use, $10,000 is a more accurate number. Certainly, it is not going to drop below the $2,000 level for a magnetic resonance imaging scan. And once the technology is in use, I think it is possible that the costs will go back up.

    Daniel MacArthur replied to this argument, "Why you CAN have your $1000 genome - so long as you learn what to do with it".

    None of this is simple, but it will become easier with time. As the retail costs of sequencing drops, a substantial niche will develop for innovators providing affordable, intuitive, accurate interpretation tools (embryonic versions already exist: see, for instance, Promethease or Enlis Genomics). Open-source academic software built for large-scale sequencing projects will be adapted for use by non-specialists. The increasing availability of large-scale computing power (for instance, via Amazon EC2), coupled with this intuitive software, will make even compute-intensive analyses available to the educated, motivated lay-person.

    MacArthur sketches out a genome interpretation landscape in which professionals and tinkerers support a community of genome hobbyists. This landscape is already taking shape thanks to MacArthur and many others (even me), and it's a solid prediction that this kind of human genomics will become more and more important, using open access tools to investigate history and phenotype prediction.

    Herper has a reply and consideration of the two posts, Herper "Debating The $1,000 Genome". In it, he notes the comments of several professionals that the $1,000 number itself is not an important fact, it is the availability of sequencing within that order of magnitude.

    The inevitability of the $1000 genome has already made it irrelevant. We should expect a $1000 genome announcement this year. This will be hype, because the real $1000 genomes won't be here until...next year! Before the end of 2014, whole genome sequences at 4x coverage will cross the $100 mark. I think there's a good chance they will be less than $50 at that time.

    Based on numbers I've seen, those numbers are around six months optimistic. Geneticists are already planning projects anticipating $100 genomes -- some suggest that the next big project should be a "Million Genomes", because there isn't any sense bothering with a hundred thousand.

    It helps to realize what is driving the rapid reduction in price. The "next-gen" approaches have shared many basic assumptions (e.g., in situ amplification) but have not thus far been stymied by bottlenecks caused by patent overlaps because they have progressed along semi-independent pathways. As the technology moves to long single-strand approaches, multiple approaches still seem viable, although we are awaiting a solid demonstration of these methods at higher throughput. Price is not the only factor differentiating startups -- sequence quality and ease of sample prep are very important. But major research institutions justify new equipment by runtime and amortized acquisition costs, over years. A new sequencer needs to run enough this year not only to pay its overhead, but to pay the opportunity costs of a five-fold cheaper sequencer next year. As long as progress along multiple trajectories is possible, tech startups will continue the rapid reduction in per-genome price -- because price is the most visible way of differentiating their offerings and extending the sequencing market.

    This cannot continue indefinitely: at some point there may remain only one viable path to faster or cheaper sequencing. Or one company may be able to make startups more difficult by cornering the essential patents along multiple development trajectories.

    There are two fundamental questions:

    1. Where's the bottom? Cells replicate DNA fairly slowly, and they don't transmit the resulting data in a form that computers can read. Today, rapid sequencing depends on running massively parallel reactions, exploiting imaging electronics and computers and far from the limit of either (which themselves continue to increase in capacity subject to Moore's Law). We may be surprisingly close to a portable sequencing device the size and expense of a film camera.

    But the bottom of the market depends may depend less on supply and more on demand. Maybe human genomes will be clinical necessities, or maybe they will remain niche diagnostic data. In either case, there's an upper limit. We'll never need much more medical sequencing than we have people.

    Genomics cannot work on the microcomputer model. Computer companies sell new equipment to people and companies who already have lots of last-generation equipment. Genomics cannot work on that model: once you have your genome sequence in the cloud, you won't need it again. By itself, this business model stabilizes at fairly expensive prices. As long as you need to bill a technician and maintain highly regulated records, your service costs will be very high. That leaves little incentive for lowering the sequencing cost. It's like the genomics DMV -- when was the last time your state gave you a technology rebate on vehicle registration?

    Future cost reductions must depend on applications of massive sequencing in agriculture, genetic engineering and synthetic DNA. Those areas can support a different business model, one that can operate on an annual basis. They create potentially a much larger, decentralized global market, like the market that supported the development of microcomputers.

    The problem is developing the applied genetics -- the "killer app" to take advantage of the cheaper technology. And that brings us to...

    2. Where's the utility? The reduction in cost is happening despite the fact we don't really know why genomes will be useful. Both Herper and MacArthur agree that one obstacle to clinical use of genomic data will be annotating and interpreting the sequences. This problem generalizes to applications far beyond clinical contexts. How do we use genomes to do anything interesting or useful?

    At the margins, of course, we know what to do with a genome. Look for damaging mutations. This is a straightforward empirical challenge -- find out how alterations to particular nucleotides would affect phenotypes, both by themselves and in combination with common variants elsewhere in the genome. Annotation and interpretation will require us to have genomes from millions of people and expression data from hundreds of thousands of human tissue samples and animal models.

    Every other use of genomic information poses similar challenges. Do we want to use genomes to place individuals in a genealogical context? We need to work out the genealogical trees for loci genome-wide and find the historical causes for correlations within these trees. Want to use genomes to predict the response of old-growth forests to rainfall fluctuations? Testing 10,000 dead blackbirds for causal factors? Same story -- gene variants, microclimates, and functional networks.

    There will be an expensive, professional class of genome interpretation. In medicine, these will be clinicians or clinical assistants of some kind. In applied genetics, these will be research geneticists and postdocs. If you want a personalized genealogical consultation, a gut microbiome assessment of your beef cattle, or a read on that speck of black mildew in the basement, there will be a consultant for you. Like today's IT consultants, these genome consultants' knowledge, skills, and price structures will vary. They may offer knowledge of the latest discoveries, a crew of paid tinkerers, or the comfort of hand-holding, but mostly they're adding value to the software.

    Off-the-shelf software may always be a step behind the state of the art in genome interpretation, but it will always be cheap. Today you can compare your genome to cataloged SNP-phenotype associations for free, or you can pay $5 a month to 23andMe for a more user-friendly interface and non-expert information presentation. I expect HMO's to incorporate similar information applications as they embrace genomics, just as most are currently moving to patient-accessible charting software. Last year's research information will always be cheap, and for most purposes it will be good enough.

    Put these things together, and personal genomics today is where personal computing was in 1973. We haven't yet had an Altair, much less an Apple 2. But it's almost in reach. Quasi-professional hobbyists can cobble together data using primitive tools, and carry out the same analyses as postdocs. Sequencing costs falling by an order of magnitude every other year. The state of the art in interpretation totally free for the trained, with applied genomics and synthetic biology as growing industries. Genomes may not be literally too cheap to meter, but they'll certainly be, as George Church has suggested, free with additional purchase.

  • Now for anthropological genomics

    Wed, 2010-10-27 15:30 -- John Hawks

    The first of the papers describing results from the 1000 Genomes project has been released today in Nature [1].

    This is "big project" genomics news. Like many announcements of this kind, it represents more of a public relations milestone than actual scientific advance. Some of the project data have been publicly available for a while -- the 1000 Genomes and HapMap projects have to their great benefit been based on the strategy of immediate data release. The new paper and its supplements include many summary statistics and report on new genetic variants that have been found -- there's a lot of information here. But most of the interesting science is just getting started. A paper like this really represents the opening of a race to use the new data for innovative research.

    Here in my lab, we are exploring the ways that whole genome sequencing can change our study of human population history. A large part of this is our work on recent selection, of course ("Why human evolution accelerated"). Whole-genome sequencing is not essential to finding many recently selected regions of the genome, but it will help enormously in narrowing down the actual functional changes that affected fitness in past populations.

    Whole-genome sequencing will rapidly improve our ability to resolve the population history of Pleistocene humans. For older events -- going back to the origins of Homo -- whole-genome sequencing will give us samples of genealogies from across the genome. We will be able to resolve some very ancient episodes of population mixture, and we have a chance of testing what kinds of events accompanied the rise of our genus. Even for events of the Late Pleistocene and Holocene, for which haplotypes of SNP markers can be useful without resequencing, whole-genome sequencing can be tremendously valuable. Reconstructing haplotypes from diploid genotypes requires us to make some assumptions about the demography of the population, which may be exactly what we are trying to discover. A sample of genomes sequenced at high read coverage will free us from some of those assumptions. It's really exciting stuff for an anthropologist.

    All those are reasons why the data will be useful for us in the long term. But at the moment, the data are not nearly so rich. The current paper reports:

    1. Whole-genome sequencing at 42x coverage of six individuals, one three-person family trio from Utah, and one family trio from Nigeria.

    2. Low-coverage (2x-6x) sequencing of 59 Yoruba, 60 Utah residents, 30 Chinese and 30 Japanese individuals. These are a subsample of the original HapMap samples.

    3. Sequencing at 50x coverage of 8140 exons in 697 individuals. These are a subset of the HapMap v.3 population samples, including Yoruba, Luhya, Utah, Tuscan, Japanese and Chinese samples. These exons come from 906 genes targeted "randomly".

    It's pretty far from a thousand genomes, and even farther from the stated goal of 2400 genomes. The low-coverage genomes are not sufficient to call genotypes across most of the genome. This is a persistent problem with "whole-genome" sequencing projects so far. A person's whole genome is mostly diploid -- two copies of most everything. Recently, we've seen several "whole-genome" sequences where each base is given a consensus value. SNP variants may be called against other people's genomes, but rarely is there sufficient coverage to call SNPs within the individual. There are exceptions -- a handful of public whole genomes are at high coverage. The exon sequencing here should be enough to call SNPs in these functional regions with great confidence. The family trios also should have enough to call SNPs. So some of these will be our first chance to do actual population genetics on diploid genome-wide sequence data.

    One important piece of analysis in the paper is the confirmation of a low rate of de novo mutations in the children of the family trios. I discussed a result last spring that came to a very low rate of per-site mutation ("A low human mutation rate may throw everything out of whack"). The rate in that paper was 1.1 x 10-8 per site per generation. The current paper comes to a rate between 1.0 and 1.2 x 10-8. I have some more written on this issue and I'll integrate the new finding and post it later in the week. This aspect of the study is pretty important to our understanding of human evolution.

    The paper makes an interesting distinction between "accessible" and "inaccessible" portions of the genome -- accessibility meaning ease of mapping and aligning sequence reads:

    Accurate identification of genetic variation depends on alignment of the sequence data to the correct genomic location. We restricted most variant calling to the ‘accessible genome’, defined as that portion of the reference sequence that remains after excluding regions with many ambiguously placed reads or unexpectedly high or low numbers of aligned reads (Supplementary Information). This approach balances the need to reduce incorrect alignments and false-positive detection of variants against maximizing the proportion of the genome that can be interrogated.

    For the low-coverage analysis, the accessible genome contains approximately 85% of the reference sequence and 93% of the coding sequences. Over 99% of sites genotyped in the second generation haplotype map (HapMap II)4 are included. Of inaccessible sites, over 97% are annotated as high-copy repeats or segmental duplications. However, only one-quarter of previously discovered repeats and segmental duplications were inaccessible

    It's an interesting decision -- just focus and report on the majority of the genome where alignment is easier.

    The paper discusses selection briefly. There's not much new here other than the identification of candidate causal variants for some selected haplotypes.

    First, it provides a more comprehensive catalogue of fixed differences between populations, of which there are very few: two between CEU and CHB+JPT (including the A111T missense variant in SLC24A5 (ref. 38) contributing to light skin colour), four between CEU and YRI (including the −46 GATA box null mutation upstream of DARC39, the Duffy O allele leading to Plasmodium vivax malaria resistance) and 72 between CHB+JPT and YRI (including 24 around the exocyst complex component gene EXOC6B); see Supplementary Table 7 for a complete list. Second, it provides new candidates for selected variants, genes and pathways. For example, we identified 139 non-synonymous variants showing large allele frequency differences (at least 0.8) between populations (Supplementary Table 8), including at least two genes involved in meiotic recombination—FANCA (ninth most extreme non-synonymous SNP in CEU versus CHB+JPT) and TEX15 (thirteenth most extreme non-synonymous SNP in CEU versus YRI, and twenty-sixth most extreme non-synonymous SNP in CHB+JPT versus YRI). Because we are finding almost all common variants in each population, these lists should contain the vast majority of the near fixed differences among these populations. Finally, it improves the fine mapping of selective sweeps (Supplementary Fig. 14) and analysis of the dynamics of location adaptation. For example, we find that the signal of population differentiation around high Fst genic SNPs drops by half within, on average, less than 0.05 cM (typically 30–50 kb; Fig. 5d). Furthermore, 51% of such variants are polymorphic in both populations. These observations indicate that much local adaptation has occurred by selection acting on existing variation rather than new mutation.

    This last point is not especially demonstrated by the new sequencing data. What we are looking at is few complete sweeps, but that's expected even if all the selected variants were novel mutations -- there just hasn't been time to fix many variants. It remains to be shown the extent to which standing variants are involved in this selection, partial sweeps of new mutations, or parallel adaptations ("Spatial dispersal, parallel adaptation, and the 'Stooge effect'"). We'll probably see a lot more interesting work on recent selection coming out of the new data.

    Science has a companion paper to the Nature data summary, focusing on copy number variation and gene duplications. I will review that one separately.

    UPDATE (2010-10-27): Dienekes pulls out an interesting passage about the Y chromosome sequences, which in at least one case recover many markers separating haplogroups once thought to be much closer to each other. Not sure what to make of that yet.


    References

  • A low human mutation rate may throw everything out of whack

    Thu, 2010-03-18 16:30 -- John Hawks

    Last week, a paper looking for the genetic causes of Miller syndrome reported the whole genomes of four members of a single family: two siblings with the disorder and their two parents without. The idea was that they would simply compare the affected and unaffected genomes. They would then find candidate loci that might account for Miller syndrome in the affected siblings. By exploiting some other sources of information, they found what they were looking for. Daniel MacArthur covered the story in his post, "Disease hunting with whole genome sequences: the good news, and the bad news".

    I got interested in another aspect of the story. With whole-genome sequences of parents and offspring, it becomes possible to directly determine the rate of mutations in each generation. The paper by Roach and colleagues did just that -- they counted 28 in the 2.3 billion bases of sequence they included in their comparison. That makes a per-site mutation rate of 1.1 x 10-8 per generation.

    Which is a pretty interesting number. You see, it's less than half what it ought to be:

    [O]ur estimated human mutation rate is lower than previous estimates, the most widely cited of which is 2.5 x 10-8 per generation (10) based on three parameters: a human-chimpanzee nucleotide divergence per site (Kt) of 0.013, a species divergence time of five million years ago, and an ancestral effective population size of 10,000. More recent estimates indicate a nucleotide divergence of 0.012 (9), species divergence time between six and seven million years ago (11–15), and ancestral effective population size between 40,000 and 148,000 (16–19). With these parameter ranges and a generation length of 15 to 25 years, the mutation rate estimate is between 7.6 x 10-9 and 2.2 x 10-8 per generation, which is consistent with our intergenerational estimate of 1.1 x 10-8. Our estimate is within one standard deviation (SD) of an earlier estimate of 1.7 x 10-8 (SD: 9 x 10-9) based on 20 disease-causing loci (20). The rate we report is for autosomes, and should be several-fold lower than that of the Y chromosome, as in the male germline more cell divisions occur per generation. Though our rate differs approximately as expected from the recently reported estimate of 3.0 x 10-8 (95% CI: 8.9 x 10-9 – 7.0 x 10-8) for the Y chromosome, the error rates make this difference not significant (21).

    You can see the obvious implication: If this mutation rate is accurate, then the average human-chimpanzee gene divergence has to be up around 11 million years ago. That can be accommodated with a 7-million-year-old species divergence only if we assume a very large ancestral population -- on the order of 50,000 or higher. Or, the ancestral effective size could be lower -- but that would make the species divergence substantially older -- 9 million years or more.

    There is a second implication. Most studies of human genetic variation have assumed that 5-million-year-old human-chimpanzee divergence and the high associated rate of mutations. If the true rate is less than half that, then the coalescence times of human genes are more than double most estimates. That would include our estimates of human-Neandertal genetic differences.

    Well, that's a fine pickle.

    I'm not quite ready to believe the very low rate estimate. The analysis in this paper uncovered tens of thousands of false positives, and had to filter through those to arrive at 28 true mutations. The filtering involved resequencing all the positives to determine which were true and which were false, but maybe there's room in there for a substantial number of false negatives, too.

    If this low estimate were true of the human-chimpanzee divergence, it would imply vastly higher ages for other primate divergences, or a much lower rate on the human lineage specifically. So that allows another check on the process.

    But generally, I'll be looking at whole-genome family comparisons with great interest, because they will give us a much more precise understanding of the rate of mutations and recombinations across the genome.

    References:

    Roach JC and 14 others. 2010. Analysis of Genetic Inheritance in a Family Quartet by Whole-Genome Sequencing. Science (early online) doi:10.1126/science.1186802

    Synopsis: 
    Whole genome sequencing of a family finds a very low number of mutations, suggesting evolution doesn't have the timescale we thought.
  • Deep versus wide genomes

    Thu, 2010-03-04 21:18 -- John Hawks

    Remember Genome 10K? Well, here's a new study by Michel Milinkovitch and colleagues, that points out the deficiencies of comparative data from 1X genomes:

    2× genomes - depth does matter

    Here, using recently-developed comparative genomic application systems, we evaluate the impact of low-coverage genomes on inferences pertaining to gene gains and losses when analyzing eukaryote genome evolution through gene duplication. We demonstrate that, when performing inference of genome content evolution, low-coverage genomes generate not only a massive number of false gene losses, but also striking artifacts in gene duplication inference, especially at the most recent common ancestor of low-coverage genomes. We show that the artifactual gains are caused by the low coverage of genome sequence per se rather than by the increased taxon sampling in a biased portion of the species tree.

    They conclude that a diversity of 1X genomes may not be as useful as a smaller number of genomes at higher coverage. Wide coverage is good for testing conserved loci, but deep coverage will be necessary for many other kinds of comparisons.

    References:

    Milinkovitch MC, Helaers R, Depiereux E, Tzika AC, Gabaldón T. 2010. 2X genomes -- depth does matter. Genome Biology 2010, 11:R16 doi:10.1186/gb-2010-11-2-r16

Pages

Subscribe to sequencing

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.