i5K/Post sequencing informatics
From ArthropodBase wiki
(Redirected from Post-sequencing informatics)
| Information on how to join this group: | |||
|
Breakout at AGS 2011
Participants
- Jeff Boore, Genome Project Solutions
- Bastien Boussau, UC Berkeley
- Sanjay Chellapilla, Kansas State
- Dave Clements, Emory
- Stefano Colella, INRA
- Alistair Darby, Liverpool
- Xiaodong Fang, BGI
- Susan Furstenberg, Genome Project Solutions
- Rainer Lehtonen, Helsinki
Notes
Notes are by Dave C with some correction by Jeff B.
These notes are raw.
- What are we supposed to write in a month?
- Build on platforms that are already there.
- Not to reinvent the wheel.
- Jeff argues for standards in gene annotation methods. This doesn't preclude special efforts on some genomes, but when comparing for a better understanding of genome evolution, we want to see how the genomes themselves have changed, without measuring instead the differences in how the genomes were annotated.
- Xiaodong: Difficult to come up with standard pipelines, given the difference in sequencing platforms and protocols
- Alistair: Analysis and sequencing need to be tailored.
- Jeff: if we are really going to do 5000 genomes we want to make it possible to compare them.
- Alistair: Enforce at database level: NCBI/EBI.
- Bastien: Could we define a series of tests? that data must pass?
- Stephano: Official gene set - every community has done it its own way.
- Come up with some standard of sharing.
- Have NCBI data, GFF, gene set
- Be useful to decide something on naming.
- Some nomenclature.
- Most of us follow flybase, but not everyone does.
- RefSeq is done the same way, but it is limited. Many new genes are not in RefSeq though.
- Can we set up a HUGO for arthropods?
- Jeff: It seems unlikely that various communities will agree on gene naming conventions. We can (and should) create a standard for naming genes of organisms without large communities, but Drosophila researchers, for example are unlikely to accept our naming conventions and their system probably is not the best across arthropods. We should create a phylogeny-based reconstruction of paralogs and orthologs of all genomes sequenced and create tables of synonyms and a web system that presents these seamlessly.
- Alistair: Gene names should be assigned by orthology.
- Jeff quoting Ross Overbeek: It's easier to understand 1000 genomes than it is to understand 1 genome.
- Xiaodong: Not a very good way to assign orthologues in insects.
- Big challenges here.
- This is an area that needs development of programs that establish orthology.
- Jeff argues that we should use trees over synteny, since synteny alone can misidentify a paralog as an ortholog after one copy of a duplicated gene translocates.
- Alistair: we agree that orthology is important.
- What else should we talk about
- Bastien working on methods that take into account both synteny and phylogeny.
- Alister: Does sequencing algorithms consider gene families and orthologues?
- Bastien: Thresholds
- Jeff: We need to find ways of visualizing the presence and absence of all or portions of all biochemical pathways.
- Stefano: We got 11 species into Pathway Tools, but each one was a lot of work. If we had common formats we would be much better off.
- Standardization is key. Otherwise people won't spend the time.
- Can we say our data is answering these questions, but thanks to standards across I5K, the data can be used to answer lots of other questions.
- Define outputs!
- Jeff has had a lot of trouble trying to get [MAKER] up and running. There would be a great advantage to making this much easier, either by setting up a service provider, making it easier to install and use, or funding Yandell to have greater computer resources so that whole eukaryotic genomes could be run there.
- Alastair: We reconfigured our compute farm to deal with version issues.
- Alistair: Are there any tools gaps that we see.
- New source of funding in UK to create informatics tools.
- Bastien: Quality control checking.
(An aside:
- Sue wants a discussion page at wiki with list of people working on it, and a skeleton.
- See pages as a means to come up with the goals of the group.
- Does not want a fait accompli. Wants a structure that encourages discussion.
- Dan says each group should decide on MediaWiki main versus discussion page.
I don't think these raw notes will particularly consider discussion.)
- Alastair argues that we shouldn't worry about naming so much, just providing good gene models and orthologues.
- Jeff really wants biochemical pathways. Community annotation can add a richness of data.
- Jeff judges that being able to compare biochemical pathways is essential and notes that community annotation can add a richness of data.
- Jeff says that a person at IU is doing confidence scores for genome assemblies.
- Jeff wants to assign two confidence scores per gene: Structure (e.g., intron-exon boundaries and start codon) and orthology assignment.
- These scores could be used to triage effort.
- Sanjay: Maker provides confidence scores. Could start there.
- Bioinformatics hotel.
- Can we get MAKER on a virtual image?
- Jeff: Can we argue that I5K will give us insight into how to do things like G10K, in the way that small genomes have helped us to interpret larger ones?
- Alastair suspects that informatics on 454 makes it competitive with Illumina, because of the informatics requirements
- Jeff notes that there is a very high error rates in homopolymers in 454 data and that is is much more expensive than Illumina per nucleotide. In whole eukaryotic genome assemblies, 454 is making only incremental improvements over an Illumina-only approach.
- Aaron notes that in assembling a transcriptome with 6 Gb of Illumina 54 bp single end and 21 Mb of 454 sequence (20 Mb of which was Titanium FLX reads), any assembled transcript that included one 454 read was ~10-30% longer than assembled sequences that were solely assembled with Illumina sequence. Even a small amount of additional 454 sequence seems to help immensely in assembly.