RODEO (Rapid ORF Description & Evaluation Online) is an algorithm to help biosynthetic gene cluster (BGC) analysis, with an emphasis on ribosomal natural product (RiPP) discovery.
The following sections illustrate basic features of RODEO. Details can be found in the manual. There are also a number of ways to utilize the RODEO output; for examples, see our advanced uses.
RODEO is a tool to analyze a local genomic region neighboring an input protein of interest and to predict possible encoded peptides (usually <100 aa) within that region (in our experience, a considerable majority of RiPP precursor peptides are not annotated in GenBank!).
Input to RODEO consists of an identifier (GenBank protein accession or GI) for a single protein (or a list of proteins):
The local genomic region is fetched:
Each protein within the region is compared against the Pfam pHMM database:
And potential precursor peptides are identified:
For lasso peptide clusters (our initial focus), peptides are also scored for likelihood using heuristic rules, and leader/core region cut sites are given:
Results are output in an HTML document containing the gene cluster as a SVG-format diagram, as well as a list of the proteins (with their Pfam matches) and identified peptides. RODEO will also output these data to CSV files for downstream analysis.
To run RODEO in its most straightforward form (Pfam-based analysis of a gene neighborhood consisting of 6 proteins upstream and downstream of an input protein), simply use the following command from the RODEO directory (containing rodeo_main.py)(here we use as input the YcaO-domain cyclodehydratase from thiostrepton biosynthesis in Streptomyces laurentii):
python rodeo_main.py ACN52311.1 -out thiostrepton
This will perform the analysis and save output files to a folder called thiostrepton.
RODEO includes the ability to predict precursor peptides of supported RiPP classes from gene clusters, including leader and core regions. The predicted exact mass of the resulting peptide is also given.
In this example, we analyze the lariatin gene cluster (which has a lasso cyclase protein with accession BAL72547.1) from Rhodococcus jostii.
python rodeo_main.py BAL72547.1 -out lariatin
This outputs to a folder named lariatin. Note that the core region, GSQLVYREWVGHSNVIKPGP, is given with its predicted mass.
Often it is important to analyze a large number of genes/gene clusters simultaneously rather than one-at-a-time. RODEO allows this by accepting an input file. This is simply a text file containing list of GenBank protein accession numbers, each on its own line, like so:
ACN52295.1 ACN80649.1 ABF54240.1 ABF54308.1 CAB15764.1 AGE65345.1 WP_018616399.1 SEN40259.1
Note that it is also possible to use .gb or .gbk files as input, these can be mixed in with accession keys on the input files as well.
When using a list file as input, it must have the .txt extension, this is how RODEO recognizes that the input is a list. When using GenBank files, the extension must be .gb or .gbk . If we saved the above list as input.txt, we'd run it like this if we want lasso peptide analysis:
python rodeo_main.py input.txt -out output -pt lasso
The -pt lasso section declares that we will be doing some sort of RiPP analysis (-pt) and that the type will be lasso. If we wanted another type of supported RiPP, such as thiopeptides, we would use -pt thio .
In theory, you can make the input list as long as you want. We've done 1000 or more accessions in one run. However, it's often a better idea to split your input up into smaller batches (the HTML files can get large). We prefer to do groups of ~500 or fewer accessions/run. When using the webtool, input lists are limited to 1000 accessions and jobs are stopped after 12 hours. We therefore suggest that users with large input queries download RODEO to use on their own machines.
If you are familiar with shell scripts, you can make several input lists in one folder and write a script to automate running one after another sequentially.
In GenBank, protein records with identical sequences are linked to several different corresponding nucleotide records; often this is due to the presence of multiple nearly-identical sequenced genomes of a single organism. Frequently, these show labeled as "MULTISPECIES" in GenBank. By default, RODEO only analyzes one of the instances if this is the case. If we want them all to be analyzed, we can use the --evaluate_all flag:
python rodeo_main.py WP_041189098 --evaluate_all
Analyzing all redundant instances of a protein is usually not necessary and will often substantially increase runtime. It can be helpful if the protein belongs to multiple organisms, or if some of the nucleotide records are from small contigs or are poorly annotated.
How do you even get a list of proteins to analyze with RODEO? Typically, RODEO will be used to analyze the local genomic region of closely related proteins or to profile a large protein family. The following are a few common ways to get suitable input lists:
If you have a known protein of interest, you can use a blastp search to generate a list of accession numbers in order to analyze a list of related proteins.
This is most easily accomplished by navigating to the BLAST website and selecting "Protein BLAST." Perform a BLAST search against the nr database.
Here we are using BAL72547.1, the accession number for the lasso cyclase from lariatin biosynthesis.
Select from the BLAST results any desired hits—often "all". Then select "GenPept".
On the resulting summary section, you can either change the summary format to "Accession List" (top left), or directly export the results to a text file using "Send To", "File", and "Accession List" (top right).
The resulting file is a list of accession numbers, each on its own line. This will work as RODEO input as long as the file has the extension .txt.
We can retrieve a collection of proteins from InterPro/Pfam. Start from the InterPro website and enter a Pfam identifier or InterPro ID into the search bar. Here we'll use the Pfam PF13575.
On the search results page, select "Proteins matched". There's nearly 900 in this example.
Export to a tab-delimited spreadsheet via "Export table TSV" on the upper-right of the results page.
This will give a table of proteins. However, it doesn't contain GenBank accessions or GIs. We can convert the given identifiers to their GenBank equivalents, however, using a mapping tool (specifically, the UniProt "Retrieve ID/mapping" tool). First, open the TSV table in Excel and select the identifiers of any or all proteins of interest. Here, it's the UniProtKB accessions in column A. Copy them.
Navigate to the UniProt mapping utility. Paste in the identifier list you just copied. In the "From" field, indicate that you're supplying UniProtKB AC/ID format identifiers. In the "To" field, select either "GI Number" if you want GIs, or "EMBL/GenBank/DDBJ CDS" if you want a GenBank-compatible accession.
Not all identifiers will map perfectly, but usually most will.
The resulting list can be downloaded and saved to a text-based input file, as usual. In this case, we now have just about every known PF13575 protein. They can be analyzed in RODEO.
Note also that you can supply this list into other analysis tools, such as a multiple sequence alignment server (for eventual phylogenetic tree generation and annotation) or into the EFI-EST to generate a sequence similarity network. In this way you can ensure your RODEO analysis and phylogenetic analysis have excellent overlap.
The results from multiple BLAST searches can be combined into a single input file. This can be useful if you want to use a collection of proteins as the input.
In this case, you might want to sort input by the highest similarity score to any of the BLAST input files. This can be accomplished using Excel, for instance.
Start by listing multiple proteins as BLAST queries using blastp. In this example, we'll use three representative lasso cyclase proteins from disparate taxa: BAL72547.1 (lariatin, from the Actinobacteria), ADU13886.1 (astexin, from the Proteobacteria), and WP_053324866.1 (an uncharacterized lasso peptide from the Firmicutes).
At the BLAST hit page, select "Download" and then specify the file type "Hit table (csv)". Save this file and open it in Excel.
This file format contains expectation values (E-values) in column L and bit scores in column M. Here we'll sort by bit score.
After sorting by column M, let's extract GI numbers from column A by entering the following formula in cell N1, and then copying and pasting the cell into the entirety of column M:
=MID(B1,4,FIND("|",B1,4)-4)
If any proteins were BLAST hit result to more than one input protein, we'll have duplicates in this list. Fortunately, we can remove these using the "Remove Duplicates" option in Excel. Select your data, then select "Remove Duplicates" in the "Data" tab. Make sure to check only the box for Column N (the newly-extracted GI numbers).
Copy and paste all the remaining values in column N into a blank text document—this is a list of homologues to any of the three input proteins, sorted by highest bit score to any of them.
Note: You can convert these GI numbers to accession numbers if desired using a utility such as Batch Entrez (selecting "Accession List" as a file output option). Although it's beyond the scope of the present tutorial, the accession can also be extracted from column B of your hit table CSV file using slightly more sophisticated Excel operations.
RODEO can utilize pHMMs beyond the default Pfam database. This example shows how to use add one pHMM from the TIGRFAM database.
You can create an HMM file from a multiple sequence alignment yourself using HMMER (not covered in this tutorial), or use a previously created HMM. Here we download a "YcaO domain protein" HMM from TIGRFAM. From within the hmm folder:
wget ftp://ftp.jcvi.org/pub/data/TIGRFAMs/14.0_Release/TIGRFAMs_14.0_HMM.tar.gz tar -zxvf TIGRFAMs_14.0_HMM.tar.gz TIGR03549.HMM rm TIGRFAMs_14.0_HMM.tar.gz
This should extract the HMM corresponding to TIGRFAM number 03549 leaving the following alignment file: TIGR03549.HMM.
Use the hmmpress feature of HMMER3 to convert the HMM file to the proper format.
hmmpress TIGR03549.HMM
Your folder should now contain the following 5 files:
Make sure all of the above are in your hmm folder (or whichever folder you want to put HMMs in). In this case, the five pressed files are located in a folder called tigrycao.
You can run RODEO now and specify the folder containing the above HMM files:
python rodeo_main.py AHY69522 -hmm tigrycao/*.hmm
The resulting HTML output will include the above TIGRFAM, as shown:
You can put as many processed HMMs in one directory as you want. Or, you can combine many HMMs into one file and use them as a single database.
Building an HMM from a multiple sequence alignment is covered in the HMMER User's Guide. A multiple sequence alignment can be generated from many different web-based tools, including MUSCLE and Clustal. Most of these tools can also be installed on your own workstation.
Increased understanding of protein function can be attained by scanning gene clusters with multiple pHMM databases. In addition to the default Pfam database, one extensive pHMM collection is the TIGRFAM database.
Here's how to use the entire TIGRFAM database within RODEO:
Go to the TIGRFAM FTP listing and download the archive as a compressed .gz archive. Then unpack it, and delete the original archive if you want. To do this directly from the command line (for the version 14.0 release of TIGRFAM), type:
wget ftp://ftp.jcvi.org/pub/data/TIGRFAMs/14.0_Release/TIGRFAMs_14.0_HMM.tar.gz tar -zxvf TIGRFAMs_14.0_HMM.tar.gz rm TIGRFAMs_14.0_HMM.tar.gz
This should give you a large number of files (about 4424 of them).
The previous step will give you a large number of individual HMM files that we need to merge into one. From the directory, you can use the cat command, for instance, which will give you a single large (~650 MB) combined file:
cat * > TIGRFAMs.hmm
Now we prepare these for use with HMMER3 using hmmpress:
hmmpress TIGRFAMs.hmm
This gives you several files:
After hmmpress is finished, we can remove the original TIGRFAM files to free up a little space with the following command:
rm TIGR0*
Move the remaining pressed files (.hmm/.h3m/.h3i/.h3f/.h3p) into your RODEO hmm directory or whichever directory you've decided to point RODEO to.
You can run RODEO now with this additional database using the -p flag (this assumes the files are in the tigrfam folder):
python rodeo_main.py BAL72547.1 -out lariatin_tigrfam -hmm tigrfam/*.hmm
The resulting HTML output will include the TIGRFAM results, as shown:
Note that this works with any collection of HMMER-generated HMM files; they can be combined into one or run individually. The above process saves you from having to hmmpress 4424 files one-by-one.
Thanks to Brandon J. Burkhart (University of Illinois) for pointing this out!