RODEO

RODEO (Rapid ORF Description & Evaluation Online) is an algorithm to help biosynthetic gene cluster (BGC) analysis, with an emphasis on ribosomal natural product (RiPP) discovery.

RODEO diagram

The following sections illustrate basic features of RODEO. Details can be found in the manual. There are also a number of ways to utilize the RODEO output; for examples, see our advanced uses.

First: What does RODEO do?

RODEO is a tool to analyze a local genomic region neighboring an input protein of interest and to predict possible encoded peptides (usually <100 aa) within that region (in our experience, a considerable majority of RiPP precursor peptides are not annotated in GenBank!).

Input to RODEO consists of an identifier (GenBank protein accession or GI) for a single protein (or a list of proteins):

A list of identifiers

The local genomic region is fetched:

The local genomic region.

Each protein within the region is compared against the Pfam pHMM database:

Pfam analysis of local genomic region.

And potential precursor peptides are identified:

Identified peptides.

For lasso peptide clusters (our initial focus), peptides are also scored for likelihood using heuristic rules, and leader/core region cut sites are given:

Scoring for lasso precursors.

Results are output in an HTML document containing the gene cluster as a SVG-format diagram, as well as a list of the proteins (with their Pfam matches) and identified peptides. RODEO will also output these data to CSV files for downstream analysis.

Output is HTML and CSV

Back to top

Single gene cluster analysis

To run RODEO in its most straightforward form (Pfam-based analysis of a gene neighborhood consisting of 6 proteins upstream and downstream of an input protein), simply use the following command from the RODEO directory (containing rodeo_main.py)(here we use as input the YcaO-domain cyclodehydratase from thiostrepton biosynthesis in Streptomyces laurentii):

python rodeo_main.py ACN52311.1 -out thiostrepton 

This will perform the analysis and save output files to a folder called thiostrepton.

Analysis with RiPP precursor prediction

RODEO includes the ability to predict precursor peptides of supported RiPP classes from gene clusters, including leader and core regions. The predicted exact mass of the resulting peptide is also given.

In this example, we analyze the lariatin gene cluster (which has a lasso cyclase protein with accession BAL72547.1) from Rhodococcus jostii.

 python rodeo_main.py BAL72547.1 -out lariatin

This outputs to a folder named lariatin. Note that the core region, GSQLVYREWVGHSNVIKPGP, is given with its predicted mass.

RODEOlariatin.

Back to top

Analyzing a list of genes

Often it is important to analyze a large number of genes/gene clusters simultaneously rather than one-at-a-time. RODEO allows this by accepting an input file. This is simply a text file containing list of GenBank protein accession numbers, each on its own line, like so:

ACN52295.1
ACN80649.1
ABF54240.1
ABF54308.1
CAB15764.1
AGE65345.1
WP_018616399.1
SEN40259.1

Note that it is also possible to use .gb or .gbk files as input, these can be mixed in with accession keys on the input files as well.

When using a list file as input, it must have the .txt extension, this is how RODEO recognizes that the input is a list. When using GenBank files, the extension must be .gb or .gbk . If we saved the above list as input.txt, we'd run it like this if we want lasso peptide analysis:

python rodeo_main.py input.txt -out output -pt lasso

The -pt lasso section declares that we will be doing some sort of RiPP analysis (-pt) and that the type will be lasso. If we wanted another type of supported RiPP, such as thiopeptides, we would use -pt thio .

In theory, you can make the input list as long as you want. We've done 1000 or more accessions in one run. However, it's often a better idea to split your input up into smaller batches (the HTML files can get large). We prefer to do groups of ~500 or fewer accessions/run. When using the webtool, input lists are limited to 1000 accessions and jobs are stopped after 12 hours. We therefore suggest that users with large input queries download RODEO to use on their own machines.

If you are familiar with shell scripts, you can make several input lists in one folder and write a script to automate running one after another sequentially.

Back to top

Redundant proteins

In GenBank, protein records with identical sequences are linked to several different corresponding nucleotide records; often this is due to the presence of multiple nearly-identical sequenced genomes of a single organism. Frequently, these show labeled as "MULTISPECIES" in GenBank. By default, RODEO only analyzes one of the instances if this is the case. If we want them all to be analyzed, we can use the --evaluate_all flag:

python rodeo_main.py WP_041189098 --evaluate_all

Considerations

Analyzing all redundant instances of a protein is usually not necessary and will often substantially increase runtime. It can be helpful if the protein belongs to multiple organisms, or if some of the nucleotide records are from small contigs or are poorly annotated.

Back to top

Ways to generate input

How do you even get a list of proteins to analyze with RODEO? Typically, RODEO will be used to analyze the local genomic region of closely related proteins or to profile a large protein family. The following are a few common ways to get suitable input lists:

A. From a BLAST search

If you have a known protein of interest, you can use a blastp search to generate a list of accession numbers in order to analyze a list of related proteins.

This is most easily accomplished by navigating to the BLAST website and selecting "Protein BLAST." Perform a BLAST search against the nr database.

Here we are using BAL72547.1, the accession number for the lasso cyclase from lariatin biosynthesis.

Input accession number into blastp

Select from the BLAST results any desired hits—often "all". Then select "GenPept".

The list of BLAST results.

On the resulting summary section, you can either change the summary format to "Accession List" (top left), or directly export the results to a text file using "Send To", "File", and "Accession List" (top right).

The list of accessions can be sent directly to a text file.

The resulting file is a list of accession numbers, each on its own line. This will work as RODEO input as long as the file has the extension .txt.

A text file containing a list of accession numbers is output.

Back to top

B. From Pfam/InterPro

We can retrieve a collection of proteins from InterPro/Pfam. Start from the InterPro website and enter a Pfam identifier or InterPro ID into the search bar. Here we'll use the Pfam PF13575.

Enter a Pfam or InterPro ID.

On the search results page, select "Proteins matched". There's nearly 900 in this example.

Selected proteins matched.

Export to a tab-delimited spreadsheet via "Export table TSV" on the upper-right of the results page.

From the results page, export to a TSV file.

This will give a table of proteins. However, it doesn't contain GenBank accessions or GIs. We can convert the given identifiers to their GenBank equivalents, however, using a mapping tool (specifically, the UniProt "Retrieve ID/mapping" tool). First, open the TSV table in Excel and select the identifiers of any or all proteins of interest. Here, it's the UniProtKB accessions in column A. Copy them.

The desired identifiers are in column A.

Navigate to the UniProt mapping utility. Paste in the identifier list you just copied. In the "From" field, indicate that you're supplying UniProtKB AC/ID format identifiers. In the "To" field, select either "GI Number" if you want GIs, or "EMBL/GenBank/DDBJ CDS" if you want a GenBank-compatible accession.

The mapping utility allows accession number format conversion.

Not all identifiers will map perfectly, but usually most will.

The databases have less than perfect harmony.

The resulting list can be downloaded and saved to a text-based input file, as usual. In this case, we now have just about every known PF13575 protein. They can be analyzed in RODEO.

Back to top

C. From multiple BLAST searches

The results from multiple BLAST searches can be combined into a single input file. This can be useful if you want to use a collection of proteins as the input.

In this case, you might want to sort input by the highest similarity score to any of the BLAST input files. This can be accomplished using Excel, for instance.

Start by listing multiple proteins as BLAST queries using blastp. In this example, we'll use three representative lasso cyclase proteins from disparate taxa: BAL72547.1 (lariatin, from the Actinobacteria), ADU13886.1 (astexin, from the Proteobacteria), and WP_053324866.1 (an uncharacterized lasso peptide from the Firmicutes).

Multiple proteins are input into blastp.

At the BLAST hit page, select "Download" and then specify the file type "Hit table (csv)". Save this file and open it in Excel.

A text file containing a list of accession numbers is output.

This file format contains expectation values (E-values) in column L and bit scores in column M. Here we'll sort by bit score.

A text file containing a list of accession numbers is output.

After sorting by column M, let's extract GI numbers from column A by entering the following formula in cell N1, and then copying and pasting the cell into the entirety of column M:

=MID(B1,4,FIND("|",B1,4)-4)

If any proteins were BLAST hit result to more than one input protein, we'll have duplicates in this list. Fortunately, we can remove these using the "Remove Duplicates" option in Excel. Select your data, then select "Remove Duplicates" in the "Data" tab. Make sure to check only the box for Column N (the newly-extracted GI numbers).

Remove duplicate values.

Copy and paste all the remaining values in column N into a blank text document—this is a list of homologues to any of the three input proteins, sorted by highest bit score to any of them.

Back to top

Using supplemental pHMMs

RODEO can utilize pHMMs beyond the default Pfam database. This example shows how to use add one pHMM from the TIGRFAM database.

1. Get an HMM file

You can create an HMM file from a multiple sequence alignment yourself using HMMER (not covered in this tutorial), or use a previously created HMM. Here we download a "YcaO domain protein" HMM from TIGRFAM. From within the hmm folder:

wget ftp://ftp.jcvi.org/pub/data/TIGRFAMs/14.0_Release/TIGRFAMs_14.0_HMM.tar.gz
tar -zxvf TIGRFAMs_14.0_HMM.tar.gz TIGR03549.HMM
rm TIGRFAMs_14.0_HMM.tar.gz

This should extract the HMM corresponding to TIGRFAM number 03549 leaving the following alignment file: TIGR03549.HMM.

2. Press the HMM into the required database format

Use the hmmpress feature of HMMER3 to convert the HMM file to the proper format.

hmmpress TIGR03549.HMM

Your folder should now contain the following 5 files:

  • TIGR03549.HMM
  • TIGR03549.HMM.h3m
  • TIGR03549.HMM.h3i
  • TIGR03549.HMM.h3f
  • TIGR03549.HMM.h3p

Make sure all of the above are in your hmm folder (or whichever folder you want to put HMMs in). In this case, the five pressed files are located in a folder called tigrycao.

3. Point RODEO to the HMM file containing your HMM (you can point it to multiple HMMs as well)

You can run RODEO now and specify the folder containing the above HMM files:

python rodeo_main.py AHY69522 -hmm tigrycao/*.hmm

The resulting HTML output will include the above TIGRFAM, as shown:

ycao

4. Other considerations

You can put as many processed HMMs in one directory as you want. Or, you can combine many HMMs into one file and use them as a single database.

Building an HMM from a multiple sequence alignment is covered in the HMMER User's Guide. A multiple sequence alignment can be generated from many different web-based tools, including MUSCLE and Clustal. Most of these tools can also be installed on your own workstation.

Back to top

Using the TIGRFAM database

Increased understanding of protein function can be attained by scanning gene clusters with multiple pHMM databases. In addition to the default Pfam database, one extensive pHMM collection is the TIGRFAM database.

Here's how to use the entire TIGRFAM database within RODEO:

1. Download the database

Go to the TIGRFAM FTP listing and download the archive as a compressed .gz archive. Then unpack it, and delete the original archive if you want. To do this directly from the command line (for the version 14.0 release of TIGRFAM), type:

wget ftp://ftp.jcvi.org/pub/data/TIGRFAMs/14.0_Release/TIGRFAMs_14.0_HMM.tar.gz
tar -zxvf TIGRFAMs_14.0_HMM.tar.gz
rm TIGRFAMs_14.0_HMM.tar.gz

This should give you a large number of files (about 4424 of them).

2. Combine the files and press them into the required database format

The previous step will give you a large number of individual HMM files that we need to merge into one. From the directory, you can use the cat command, for instance, which will give you a single large (~650 MB) combined file:

cat * > TIGRFAMs.hmm

Now we prepare these for use with HMMER3 using hmmpress:

hmmpress TIGRFAMs.hmm

This gives you several files:

  • TIGRFAMs.hmm (~650 MB)
  • TIGRFAMS.hmm.h3m (~120 MB)
  • TIGRFAMS.hmm.h3i (~248 kB)
  • TIGRFAMS.hmm.h3f (~275 MB)
  • TIGRFAMS.hmm.h3p (~324 MB)

After hmmpress is finished, we can remove the original TIGRFAM files to free up a little space with the following command:

rm TIGR0*

Move the remaining pressed files (.hmm/.h3m/.h3i/.h3f/.h3p) into your RODEO hmm directory or whichever directory you've decided to point RODEO to.

3. Point RODEO to the TIGRFAM files

You can run RODEO now with this additional database using the -p flag (this assumes the files are in the tigrfam folder):

python rodeo_main.py BAL72547.1 -out lariatin_tigrfam -hmm tigrfam/*.hmm

The resulting HTML output will include the TIGRFAM results, as shown:

RODEOtigrfam.

4. Other considerations

Note that this works with any collection of HMMER-generated HMM files; they can be combined into one or run individually. The above process saves you from having to hmmpress 4424 files one-by-one.

Thanks to Brandon J. Burkhart (University of Illinois) for pointing this out!

Back to top