Pages

Monday, June 20, 2011

Using O104:H4 EHEC data... an example

I've had a few requests for an example of how to work with the new EHEC data.  I agree it can be very overwhelming to have  hundreds of Megabytes of genomic data, so here is a fairly simple example of what one might do and what you might encounter.  Suppose you had a drug (antibody, peptide, small molecule) and you knew it hit a protein called EprK.  EprK is an approximately 250 amino acid protein that is part of the Type III Secretion System (T3SS).  The T3SS is the cell-surface protein complex that attaches the pathogenic bacteria to the host cells.  Blocking proteins like EprK is one possible way to prevent EHEC pathogens from attacking normal cells and causing disease. Your drug works on other EHEC strains (such as O157:H7, the strain responsible for the 2006 outbreak in the US) but will it work on O104:H4?  Testing it directly is the best way to know, but obtaining the new strain is likely to be very difficult.  Another option is to go to the sequence data. 

I went to one of the sites that has the new sequence information (based on 'crowdsourcing' from various labs) on O104:H4 (I used the oh no sequences blog -- the blog for the R&D section of era7 bioinformatics) and found the identifier code for the EprK protein (here's the link).  Some of the data has been annotated based on sequence homology and EprK was one that has been identified.  Using this code, I found the DNA sequence and copied it to the clipboard.  Then I went to the NCBI website (link) and pasted the DNA sequence into the search box to do a BLAST search of all microbial genomes that have been sequenced.  There were dozens of hits, and nearly all of them were EprK proteins from various strains.  I found the O157:H7 strain and the alignment is impressive.  More than 95% of the DNA bases are identical between the two, suggesting that the two proteins are very similar. I've included the BLAST results of my search below using O104:H4 EprK (Query, top strand) and it's alignment with O157:H7 EprK (bottom strand). So, your drug probably works on the new strain too.  If you want the amino acid sequence of the O104:H4 strain, simply take the DNA sequence to ExPaSy (link) and translate it.  It actually took me a bit to get the protein sequence because there is a frameshift mutation in the O104:H4 sequence read.  If you scroll down to my alignment and find the part highlighted in red, you will see there is an extra adenosine (an 'A' base) in the O104 sequence.  This throws off the protein translation.  I assume it is a mis-read in the O104 sequence (a common mistake when the sequencing machine reads through a string of the same base) and deleted it when I translated from DNA to protein.  The resulting amino acid sequence (pasted below) is very similar to EprK from other EHEC strains.  I'll double check this and follow up with them.

     Anyhow, I don't think there is a structure for the EprK protein, but if there was, you could use the existing structure as a model and make the amino acid changes seen in the O104:H4 strain to give you a decent starting point for the structure-based design of new drugs.

 Find a pathogenic protein of interest and try this yourself... it's not too hard.  When the topic of EHEC comes up at the next party, you can impress your friends by saying you blasted several virulence factors and found them to be quite similar/different from strains of previous outbreaks.  I would do this myself but, oddly enough, I don't get invited to parties anymore.  Anyhow, as a final disclaimer... although I have tried to be careful please verify anything I have posted before use.



Query  1       GTTGAGGATGAATATAACTAATTGGATCATATATAATCTTTCTTAGGGCAAGATTCATAA 
               |||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||
Sbjct  443403  GTTGAGGATGAATATAACTAATTGGAGCATATATAATCTTTCTTAGGGCAAGATTCATAA 

Query  61      CGCTCTCATATGTCTACTTAATTTTCAACCTGACTAAATTAGTTAGAATGGCCCTATACT 
               || |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  443343  CGTTCTCATATGTCTACTTAATTTTCAACCTGACTAAATTAGTTAGAATGGCCCTATACT  443284

Query  121     TCCATAACAGCCAGCAAGTCGCTACGGATATTAATGCAAGTAAGATAGAAACCGGCATAG 
               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  443283  TCCATAACAGCCAGCAAGTCGCTACGGATATTAATGCAAGTAAGATAGAAACCGGCATAG  443224

Query  181     CCTTATCATAAGCAAAAACAGGTTCGCTAATTTCATATGTTGGTGCTTGCTCAATAATGT 
               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  443223  CCTTATCATAAGCAAAAACAGGTTCGCTAATTTCATATGTTGGTGCTTGCTCAATAATGT  443164

Query  241     CTCTTCGTTTTGACAATACAACAGAAATATTTTCATATTGTACGCTTGCAGAGCTATTAA 
               ||||||||||||||||||||||||||||||| |||||||||||||||||| | |||||||
Sbjct  443163  CTCTTCGTTTTGACAATACAACAGAAATATTCTCATATTGTACGCTTGCAAAACTATTAA  443104

Query  301     CAATAAATCTCTTGATATCATTTATTTTTATTTCTGGGTTGATATCTTTTTCATATACTG 
               ||||||||||||| || |||||||||||||||||||| ||||||||||||||||||||||
Sbjct  443103  CAATAAATCTCTTTATGTCATTTATTTTTATTTCTGGATTGATATCTTTTTCATATACTG  443044

Query  361     CAAGTACAGAAATATGAATTGGTAAAGCAGTTTTACCACTATCGCCATTATCAACATCGT 
               ||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||
Sbjct  443043  CAAGTACAGAAATATGAATTGGTAAAGCAGTTTTACCACTATCGCCAGTATCAACATCGT  442984

Query  421     AACTAACATGTACTCTCGAAGAAATAATGCCATCCATAATTTTGAGAGATTGCTCTAACC 
               |||||||||||||||||||||||| ||| |||||||||||||||||||||||||||||||
Sbjct  442983  AACTAACATGTACTCTCGAAGAAACAATACCATCCATAATTTTGAGAGATTGCTCTAACC  442924

Query  481     GCTGCTCAATAGCAGAATATAGCCTTGCTTTTTCCGCTCGTGGAGATGAAAACGAGTGCA 
               ||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||
Sbjct  442923  GCTGCTCAATAGCAGAATATAGCCTTGCTTTTTCCGCTCGTGGAGATGAAA-CGAGTGCA  442865

Query  541     TCTGCAGGGAACATCTGCGATATTTGAATATCAGGCTTACCCGGTAGATTGTAGATTTTT 
               |||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||
Sbjct  442864  TCTGCAGGGAACATCTGCGATATTTGAATATCAGGCTTACCCGGGAGATTGTAGATTTTT  442805

Query  601     AGCCAATCCACCGCAGAAGCAAAATCCGTTGGTTCGACAAATATTGAAAATCCTGTTTTG 
               ||||||||||||||||||||||||||||||||||| ||| | ||||| ||||| ||||||
Sbjct  442804  AGCCAATCCACCGCAGAAGCAAAATCCGTTGGTTCAACATAGATTGAGAATCCAGTTTTG  442745

Query  661     CCTTGATCCTTCTTTTCAGCATTAATATTATGTCTTTGTAAAACAGCAAGGACATCATTA 
               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  442744  CCTTGATCCTTCTTTTCAGCATTAATATTATGTCTTTGTAAAACAGCAAGGACATCATTA  442685

Query  721     GCTTGCTGTTGATCAAGATGGTTCAATAATTCCTGCTGCTTGCAGCCGCACAACAGCAGG 
               ||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||
Sbjct  442684  GCTTGCTGTTGATCAAGATGGTTCAGTAATTCCTGCTGCTTGCAGCCGCACAACAGCAGG  442625

Query  781     ATAAACAATAATA  793
               |||||||||||||
Sbjct  442624  ATAAACAATAATA  442612


Predicted amino acid sequence for O104:H4 EprK protein, (corrected for gap): 

L L F I L L L C G C K Q Q E L L N H L D Q Q Q A N D V L A V L Q R H N I N A E K K D Q G K T G F S I F V E P T D F A S A V D W L K I Y N L P G K P D I Q I S Q M F P A D A L V S S P R A E K A R L Y S A I E Q R L E Q S L K I M D G I I S S R V H V S Y D V D N G D S G K T A L P I H I S V L A V Y E K D I N P E I K I N D I K R F I V N S S A S V Q Y E N I S V V L S K R R D I I E Q A P T Y E I S E P V F A Y D K A M P V S I L L A L I S V A T C W L L W K Y R A I L T N L V R L K I K

0 comments:

Post a Comment

 
 

Blogger