What can I do to find out more about my protein?
The first thing you should do is to compare your sequence to databases on the internet, and see if there is information about yours or related (homologous) sequences. 'Homologous' means that two sequences are evolutionarily related. They will generally share domain structure and function. In other words, you should start doing a homology search. This is often done using the program Blast.
If you have two similarly sized proteins with a large proportion of similar residues and few gaps it is easy, but what if you don't?
The next thing to do is to look for conserved domains in your sequence. If two proteins share the same domain structure and also align reasonably well they are likely to be homologous.
Domain structure is also important to consider if you want to do a multiple alignment. It does not make sense to align proteins that do not share the same domain structure.
The Gene Ontology database contains information about function and cellular localization of proteins.
In general, bioinformatics databases and tools on the internet are maintained by research groups, and are often updated and developed. Priority is on data quality and reliability (and not on layout and text formatting...). But this is where you find the newest data on your protein!
Other Databases & Tools
- More than 300 Pathway/Genome databases can be found at the BioCyc homepage.
- To get an overview of ongoing and finished genome projects, go to the NCBI genome page. For instance, get a list of eukaryotic genome projects by clicking on 'Eukaryotic' at the right side.
- The Human Genome Project is finished now, and other mammalian genomes, incl. mouse and chimpanzee are in progress.
- The chicken genome is nearly finished, go to the NCBI homepage.
- Plant genome sequences include rice and the weed Arabidopsis thaliana.
- Several fish genomes are being sequenced, including the Japanese pufferfish (a very compact genome, several fold smaller than the human genome), and the zebrafish.
- Microbial genome databases include Subtilist and Ecocyc. At genolist.pasteur.fr you can find links to other microbial genome databases.
- Tools for prediction of different structural features (signal peptides, transmembrane helices, glycosylation sites etc.) at Center for Biological Sequence analysis, DTU.
- Jeanmougin F, Thompson JD, Gouy M, Higgins DG, Gibson TJ. Multiple sequence alignment with Clustal X. Trends Biochem Sci. 1998 23(10):403-5.
- Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 2003 31(13):3497-500.
- Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz H-R, Ceric G, Forslund K, Eddy SR, Sonnhammer ELL, Bateman A. The Pfam protein families database. Nucleic Acids Res. 2008 36(Database issue):D281-D288.
- The Gene Ontology consortium. Creating the Gene Ontology Resource: Design and implementation. Genome Res. 2001 11(8):1425-1433.
- Geer LY, Domrachev M, Lipman DJ, Bryant SH. CDART: Protein homology by domain architecture. Genome Res. 2002 12:1619-1623.