i5K/Hackett AGC talk 2011-06-09
- Presented at the 5th Arthropod Genomics Consortium Annual Meeting, June 9, 2011
Introduction
In my talk today, I’ll be providing a status report on the initiative to sequence 5,000 insect and other arthropod genomes starting over the next 5 years, an initiative we are calling i5k. Specifically:
- What is our idea and where did it come from?
- Where are we now?
- How do we keep going?
Why do we even think we can succeed with a large science project like this when budgets are in decline in most places? Part of the answer is that sequencing costs are also in decline. The i5k price tag is about $15 million ($5M for sequencing 5,000 genomes at $1,000 per genome, $5M for bioinformatics and annotation pipeline work, and $5M for systems biology and mining the data).
What is our idea?
So, what is our idea and where did it come from? It started, again, with sequencing capacity rising, and costs plummeting.
For capacity, recent support for the Beijing Genomics Institute, alone, has greatly increased world sequencing capacity. One might look at it this way. Some have estimated that there will be a theoretical sequencing capacity of about 24,000 gbp at the world’s collective sequencers by the end of this year. This is 240,000 human genomes at 30X coverage.
For sequencing cost, that is no longer the limiting concern in obtaining the genome of any one arthropod.
While sequencing the honey bee cost about $9 million about 10 years ago, we anticipate that sequencing costs for an insect of 400 mbp, less than $10,000 now, will soon be under $1,000, the basis for our i5k cost estimate. If Hodgkin’s Law (related to Moore’s Law) continues to hold about the cost of DNA sequencing technology declining by half every 2 years – with sequence in GenBank doubling every 18 months – these trends will likely continue until we are using genomics extensively even at the level of population genetics!
Cost, nevertheless, still is a factor in getting physical maps, or mostly in terms of time – in getting inbred arthropods, isolating DNA, annotating, or getting cell lines.
The bottom line, nevertheless, is that it’s not a matter of “Will we sequence large numbers of genomes?” – I think that is fairly inevitable – but rather one of “How do we shape this process to fit our individual and collective needs?” In this sense, we believe i5k will make two important contributions to ongoing genomics efforts:
- i5k will attract and leverage funds sooner rather than later, and more rather than less, which will depend on meeting societal needs – That is, i5k puts food in our belly, biofuels in the tank, and kills blood-sucking mosquitoes that make kids sick. Of course, there is also good fundamental science to come from this.
- i5k will, almost by definition, facilitate comparative genomics. Related to this, we plan to use a phylogenetic approach when selecting many of our targets, thus capturing key arthropod branching nodes.
That is, the i5k project will allow us to collectively speed and guide our arthropod genomics efforts, therefore furthering the goals of AGC members. i5k, in turn, will benefit from AGC in being able to draw on organizational and informational support.
So, this potentially unique dynamic between AGC and i5k is the first of three factors that inspired the founding of i5k. And you’ll notice that AGC members are leaders in the i5k group.
The second factor that triggered the forming of i5k was the promise of other large genomics efforts, e.g., the G10K effort to sequence 10,000 vertebrate genomes.
Under the leadership of Steve O’Brien [at NIH’s NCI], David Hausler [at UC-Santa Cruz], and Oliver Ryder [at the San Diego Zoo], the vertebrate community launched their Genome 10K project in 2009, with a publication in J. Heredity. Since then, vertebrate sequencing projects have gone from 32 mammals to 310 vertebrates, mostly due to commitments of 101 genomes by BGI, 32 mammals by NIH, 30 primates by Broad, and another 100 vertebrates from other NHGRI-funded centers.
This is a large commitment, particularly since vertebrate sizes and sequencing costs are generally a few times larger than arthropod ones. I say “generally” because the locust genome, recently sequenced by BGI, is about 7,000 mbp (some are 17,000 mbp!) – over 2x larger than the human genome and almost 190x larger than the smallest genomed arthropod that I know of, the 90 mbp spider mite.
The third factor was the recent significant commitment to arthropod sequencing, which provided the final push to form i5k. For arthropods, the NCBI database has grown now to 87 insects, over 100 arthropods in total. And in addition to recent commitment by BGI to sequence 100 insect genomes, 15-plus bees were about to be put in the queue due to an NIH Pioneer Award to Gene Robinson. It was in this context that Gene approached me to see if a more inclusive effort could be organized. The “i5k insect and other arthropod sequencing effort” was born from this, with launch of the initiative with a letter in Science on March 18 of this year (2011).
So, what kinds of societal challenges do we hope to address with this collective effort? For starters:
- Energy: meeting energy needs – through biofuels, in part, and those biofuels have pest problems;
- Climate Change: monitoring climate change through better understanding of the metabolism and ecology of candidate sentinel arthropods, and, the reverse process, preparing for pest range expansion due to climate change;
- Food Safety: There is a close link between insects feeding on crops and aflatoxin production at the site of the insect feeding; and,
- Nutrition and Food Security: providing for nutrition and global food security which of course means controlling insects, and also growing some of my favorite aquatic arthropods.
We need to make the legitimate case that connects SEQUENCE to SCIENCE to SOCIETY!
Our project not only addresses those needs, but also needs for protecting our forests – one longhorned beetle is estimated as a $670 billion threat to the United States alone. And based on very conservative actuarial tables, there are $50 billion in losses worldwide due to vector-borne diseases. We not only need to create a more liveable world, but a also a more sustainable and biosecure one!
So, more specifically, what types of breakthroughs do we expect these sequencing projects to bring? I see three types of advances or products – emergent, novel, and transformative.
For EMERGENT advances, we see:
- improved targets for pesticide discovery, including interruption of pest immune pathways, and ease in finding RNAi targets;
- applications to solving malaria and problems of vectored animal pathogens;
- improved SIT and enhanced biological control agents;
- forensics applications;
- aquaculture improvement;
- finding pathogens by subtractive genome work – such as has already been done for SARS;
- and, as our logo suggests, improved biodiversity conservation.
NOVEL advances will likely include:
- utilizing arthropod sensory receptors as detectors for biodefense;
- elucidating tritrophic relationships, including those with vectored plant diseases; and,
- elucidating insect roles in carbon sequestration & methane effects, particularly those associated with ants and termites.
TRANSFORMATIVE advances will likely include:
- better understanding of brain function;
- discerning evo/devo relationships;
- using genomic patterns to infer methylation and other methods of gene regulation;
- molecular-based arthropod advancements; and
- in systems biology – elucidating compartmentalization of cellular function, with direct application for our control of cellular processes. We are hopeful that this massive set of virgin data, representing 5,000 insects, will attract young applied mathematicians into the field, since there is some exciting work to be done here in the area of algorithm development, some new math approaches.
So, if we have all this promise, there must be evidence for this optimism – based on prior accomplishments, right? As a matter of fact, there is lots of evidence and we’ll be pulling this together; it will be on the i5k website and would also make a good publication. Due to time constraints, here are just four examples that apply to colony collapse disorder of bees, a project of intense societal and congressional interest:
- The effort to control colony collapse disorder in bees has benefited from discovery of novel microbes that might have been unrecognizable without first masking the bee genome sequences;
- Genomes of Nosema, viruses, and varroa, important causes of bee mortality, are being examined to find targets for use in RNAi silencing, a promising strategy;
- Insect genome projects revealed comparatively poor cytochrome p450 detoxification pathways in bees, not a surprise since plants are in the business of rewarding, not poisoning, pollinators. This information is needed to find safe ways to control crop pests while at the same time sparing the bees for pollination. This detox gene finding is a good example of the need for comparative genomics.
- There have also been fundamental discoveries, e.g., discovery of the honey bee’s methylation system, and its involvement in queen production. This has opened up the field of epigenetics, or at least provided a new model in play.
One can only imagine what critical discoveries will be made when we have 5,000 insect genomes to compare.
And let me answer that the number of insects to be sequenced – 5,000 – is more of a placeholder than an actual number. It is important to appreciate the immensity of trying to cover all key insects in agriculture and health, and key arthropods… as well as, from a systematics and comparative genomics viewpoint, the desire to sequence representatives from all deep branch nodes of the arthropods. Whereas the vertebrate community has a goal of sequencing 10,000 of the 60,000 described vertebrate species – often for purposes of biodiversity preservation during possible extinctions, we have over a million described species of insects alone – and some suggest 30 million undescribed ones. There are 29 orders of insects, and over 500 speciose families. At 5,000 target arthropods, we can get a good sample of arthropod biodiversity – up to 20 genera for each of these speciose families – all the while aware that we will need many more dips of the pipette into the deep lake of arthropod diversity. From a comparative genomics viewpoint, we have the challenge of many more gaps to fill in the field of information, but the reward of a commensurately greater amount of information to harvest!
Where are we now?
We have an Ad Hoc Group that advertised the launch in Science, calling for communities to form, and for communities to nominate arthropod targets for sequencing.
And an i5k website, graciously maintained by Dan Lawson at Ensembl Genomes, and associated with the AGC website.
As an Ad Hoc Group, we’ve briefed the Council of Entomology Department Administrators, and several USDA agency heads [at ARS, NIFA], and NSF. As well as Steve O’Brien’s group at NIH. The AGC meeting is the first opportunity to engage with the broader international community.
Success in this effort is not, in the end, going to be determined by an “Ad Hoc Group,” which was formed as a temporary group for the sole purpose of launching i5k, but rather by what you do as communities and in your roles as contributors – as biologists, as sequencers, or as computational biologists and bioinformaticians.
This effort will be an international one. At another meeting later this month, some of the AHG will be meeting with scientists at BGI, as part of a meeting associated with the International Social Insect Genomics Research Conference there later this month in China. This will hopefully continue to solidify collaboration between international efforts.
So, that is where we are organizationally. Where are we in arthropod sequencing? Well, prior to i5k, there were already 87 insect projects listed in NCBI, as well as: 6 chelicerates, including 3 ticks, 2 mites and a horseshoe crab; 1 myriapod, a centipede; and 6 crustacea, including the water flea Daphnia, an amphipod, a freshwater shrimp and the Pacific white shrimp, and a copepod, the salmon louse.
Of the insects, most are flies, 59 in all, mostly Drosophila or mosquitoes, but also tsetse fly, Hessian fly, the screwworm, a horn fly, and of special recent challenge – a tephritid and the sand fly. There are quite a few bees and ants, 16 at last count. And there are 5 true bugs, including aphids, a whitefly, a psyllid vector of citrus greening, and Rhodnius, a vector of Chagas disease. There are 4 lepidopterans, 3 silkmoth entries and a butterfly. And there is 1 each from three other orders: a Tribolium beetle, the human body louse, and a termite. It is a list that includes some major pests and beneficials, including a parasitoid wasp, as well as some key model organisms. The majority of these have some assembly, with coverage from 1x to 100+X, with 8-12X being typical.
For i5k, 76 arthropods have been nominated to date. This includes 4 already in NCBI, including the cattle tick, which has some low coverage assembly. We welcome projects already underway. These projects could, after all, often benefit from additional funding. And, in the end, we believe that projects not yet even imagined now – including those that will inform projects already done – would benefit from future collective visibility given by i5k – which can help bring attention to all of our needs – 1st in sequencing and bioinformatics, later for post-sequencing genomics. In any event, this i5k connection is helped along by having i5k Wiki Pages interwoven with other Wiki Pages on the common AGC website.
What is almost shocking when one looks at the newly nominated arthropods, is that many are household words. For example,
For veterinary and human pests:
- Bed bug
- Asian tiger mosquito
- Human head louse
- Screwworm
- Sheep blowfly
- Flesh fly
- Pharoah ant
- House fly
- American cockroach
For crop pests:
- Brown marmorated stink bug
- Glassy-winged sharpshooter, the vector of Pierce’s disease
- European grapevine moth
- Corn earworm
- Tobacco budworm
- Beet armyworm
- Rice weevil
- Colorado potato beetle
For forest and rangeland pests:
For other arthropods:
- A crayfish
- And an Amphipod that is emerging as a new model for development
Significantly, proposed arthropods include much more diversity, including many more moths, beetles, and true bugs, which are quite underrepresented in current efforts, and also new orders such as a neuropteran lacewing, a thrips, a mayfly, and some cockroaches. We also have a crayfish and other crustaceans, as well as chelicerates.
How do we keep going?
We need to increase those in the i5k tent, internationally, and by specialty. This project should exemplify what the nexus of science and society is about today and for the future – cross border and cross discipline.
Please join i5k and nominate arthropods, and continue to form your communities so that when opportunity – i.e., funding – arises, we’re ready to go.
As for projects in AGC, we will rely heavily on the Wiki Page strategy of community self-assembling. For the Wiki Page approach, thanks again to Dan Lawson for coordinating the effort. To set up a proposed arthropod target page through i5k, please go to the i5k website and enter your data. Community pages can be set up through this site as well.
Another key ingredient for these projects will be making sure we have enough high quality DNA from inbred lines when opportunity arises. This will greatly increase the chance of accurate assembly.
We’ll also be soon posting suggested guidelines for good sequencing projects on our website, with input from Hugh Robertson and Stephen (fringy) Richards.
Beyond sequencing, we are also continuing to upgrade our website. We welcome ideas for making it meet your needs. For example, should we have lists of funding opportunities, specialists such as bioinformaticians? We also need to find collaborators to host sequencing data. And to address these and other challenges, we have formed Task Teams and are looking for additional volunteers to lead and work with these teams. Please see me or anyone in the Ad Hoc Group – listed on the i5k website – if you are interested. The Task Teams are [as amended following discussion at the meeting on Sunday June 12th, 2011]:
- Funding and resources
- 2012 workshop and training beyond
- Criteria for prioritization of arthropods/Pre-sequencing considerations
- Systematics/phylogeny considerations
- Community access/website
- Post-sequencing bioinformatics (large scale, for minimum genome project)
- Community curation of a genome sequencing project
We have a Workshop Team because we are planning a workshop to be held in early 2012 to bring arthropod communities together with Federal agencies and other interested partners. We’ll be posting information on this meeting on our website. Besides facilitating formation of communities and addressing common concerns such as bioinformatics needs, a major goal of the workshop will be to produce white papers. And, we could really use some collective genius in coming up with solutions to some intractable problems, e.g., how to deal with repetitive DNA which can scramble exons and regulatory regions. Or how to deal with high levels of polymorphism.
The bottom line for all of this work is the need for strong communities around each arthropod target or goal. We expect that the communities might actually vary in type from:
- Gene Function-based (e.g., sense receptors)
- Taxon-based (e.g., moth)
- Sector-based (e.g., Agriculture, Medical/Veterinary, Climate Change, Energy)
Other communities might be highly novel, e.g., ones focusing on systems biology and composed, in part, by applied mathematicians such as those doing network analysis. This team would work to reveal cellular compartmentalization and functional pathways, discerning these units from sequence data.
Again, this work might be done with or without my or your help, but it won’t necessarily be done with our goals in mind if we don’t get actively involved. We need more collaborators at all levels, from leadership and membership on task teams to leaders and members of the sequencing project communities.
Let’s have the fun of making this happen together!