ProGenExpress 1.0 – Tutorial

Introduction

ProGenExpress is a package for the R statistical software (http://www.R-project.org) which has been written to aid researchers view numerical data (E.g. microarray data) in the context of the bacterial genomes on which that data is based. A basic knowledge of R will be useful, although this tutorial will attempt to cater for complete beginners.

Importing Microarray (numerical) Data

Data in ProGenExpress is stored internally in a structure called a “data frame”. In R, type:

?data.frame

for more information on this structure.

The first step is to start R and load the ProGenExpress package. In R, type:

library(ProGenExpress)

ProGenExpress comes packaged with some data already in a data frame, for use with this tutorial. To load this data, in R type:

data(IFR)

?IFR

IFR[1:5,]

This is data from Eriksson et al (2003) consisting of gene expression measurements for Salmonella typhimurium at 4h, 8h and 12h following macrophage infection. The data is stored in a data frame.

To read your own data into a data frame in R, the best format to have your data in is either a tab-delimited text file, or an Excel file, with each row representing a spot or gene on your microarray, and each column representing a particular experiment or microarray. The function “read.table” reads text files into a data frame, and the function “read.xls” in the gdata package reads excel files into a data frame. Type the following for more info:

library(gdata)

?read.table

?read.xls

For example, if your data is in a tab-delimited text file, with the first row as column headings, you may want something like:

mydata <- read.table(“myfile.txt”, header=TRUE, sep=”\t”, quote=”\””)

Reading a Genome

The next step is to read into R information relating to the genome of the organism on which the (microarray) measurements have been carried out. Again, ProGenExpress stores this information in a data frame. ProGenExpress comes packaged with data for the Salmonella typhimurium genome already included, and two functions which can create a data frame from commonly available genome resources.

First, the internal data. Type:

data(STLT2)

?STLT2

STLT2[1:5,]

The first five genes of the S typhimurium genome should be displayed, giving information on the start and stop locations of each gene, strand, length, PID, gene name, gene synonym, COGs code and ID and the gene product. The data frame has this information for all 4425 known expressed genes.

The above data frame was created using a ProGenExpress function called “read.ptt”. The file that this function read is called NC_003197.ptt, which is available at ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Salmonella_typhimurium_LT2/NC_003197.ptt. Similar files for many sequenced bacterial genomes are available from this ftp site, and can be accessed by browsing ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. Once the .ptt file for the genome of interest has been downloaded, then read.ptt can be used:

?read.ptt

STLT2 <- read.ptt(“NC_003197.ptt”)

STYPHI <- read.ptt(“NC_003198.ptt”)

ECOLI <- read.ptt(“NC_ 000913.ptt”)

There is a wrapper function which allows the user to download a .ptt file interactively using lists. To access this, type:

read.NCBI()

Finally, there is a function read.GenBank which attempts to read whole genome GenBank files from RefSeq into a similar data structure as produced by read.ptt. This is a very difficult task as RefSeq GenBank entries are variable in format. However, the function read.GenBank has been shown to work with NC_003197, NC_003198 and NC_000913 Refseq genomes. To access this function and its help, type:

?read.GenBank

STLT2 <- read.GenBank(“NC_003197.gbk”)

STYPHI <- read.GenBank(“NC_003198.gbk”)

ECOLI <- read.GenBank(“NC_ 000913.gbk”)

These .gbk files are also accessible from ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. The main advantage of using .gbk files instead of .ptt files is that .ptt files, by definition, only deal with genes which produce proteins, i.e. they do not contain information relating to pseudogenes. In my experience, researchers who wish to view numerical data, such as microarray data, in the context of the genome wish to deal only with expressed and translated genes, and so this is not a problem. However, in an attempt for completeness, read.GenBank does try to gather information on the location of pseudogenes. Due to the complexity of the function and the variability of the GenBank format, read.GenBank is provided “as is”, and I do not intend to support it as well as the other functions.

There is nothing to stop users creating their own data frames describing their genomes. The data frame, as a minimum, must have columns Start, Stop, Strand, Gene and Synonym. Following the NCBI convention, Start is always less than Stop and Strand should be either + or -.

Linking them together

We now have two data frames, one containing microarray data, and the other containing the genome. We need to link these two together so ProGenExpress can draw the data. We wish to end up with a new data frame, with one row per gene, and columns representing gene names and locations, and the numerical data.

As many microarray experiments contain multiple spots for each gene, we need a way of condensing these multiple measurements into a single measurement. For simplicity, ProGenExpress takes the mean value.

In order to link rows from the microarray data to rows from the genome data, the two data frames must have a column in common which tells ProGenExpress which rows correspond. As a default, ProGenExpress uses the “Synonym” column where this exists, though users can define which columns are used. To produce a linked data frame, after loading the data (here we will use the provided datasets, IFR and STLT2), type:

linked <- linkem.avg(STLT2,IFR,"Synonym","Synonym",

cnames=c("M4h","M8h","M12h"))

To access help on this function, type

?linkem.avg

The first and second arguments specify the genome data and the microarray data to link together. The third and fourth arguments indicate the columns in each data set which identify equivalent rows. Finally, the cnames argument specifies the column names in IFR which contain the numerical data we are interested in.

Displaying by base number

Now we have the data in the format we need, we can display it. There are two functions which do this – plotrange and plotrange.vertical. To access help for these two, type:

?plotrange

?plotrange.vertical

There are many options to play with, but the simplest way of calling these is:

plotrange(linked,start=1,stop=50000,cnames=c("M4h","M8h","M12h"))

This shows all genes within the base range 1 – 50000, with one bar per gene per time point. The above plot ignores intergenic distance. A more representative plot of the genome can be produced by:

plotrange(linked,start=1,stop=50000,cnames=c("M4h","M8h","M12h"),distscale=50)

Here a scaling factor of 50 has been used, and the gaps between the genes on the plot are representative of intergenic distance. See:

?plotrange

for more detailed information.

ProGenExpress can also plot vertical images of the genome:

plotrange.vertical(linked,start=1,stop=40000,cnames="M4h",

display.gname="Synonm", posleft=FALSE,

distscale=100)

The call to plotrange.vertical demonstrates some of the other arguments to the two functions – now only one bar is displayed for each gene (we specified only “M4h” with the cnames argument), and the Synonym is displayed instead of the gene name.

Finally, regions of the genome may be highlighted. These regions are identified by either gene name or Synonym:

highlight <- data.frame(start="talB", stop="bcfH", name="highlight", color="grey")

plotrange(linked,start=1,stop=50000, c("M4h","M8h","M12h") ,

display.gname="Synonym", highlight=highlight,

distscale=30, genomecol="dark grey")

Displaying by gene name

Similar to the plotrange and plotrange.vertical functions, users can plot ranges of the genome identified by gene names or synonyms. The function is called plotrange.bygene, and help can be accessed by typing:

?plotrange.bygene

The function uses plotrange and plotrange.vertical internally and hence takes similar arguments and outputs similar figures:

plotrange.bygene(linked,start="thrL",stop="nhaR", cnames=c("M4h","M8h","M12h"),

distscale=50)

plotrange.bygene(linked,start="STM0018",stop="STM0040",

cnames=c("M4h","M8h","M12h"), vertical=TRUE,

match.gname="Synonym", distscale=50)

The two graphs show similar plots, both scaled to show intergenic distance. The vertical plot is selected by STM number (gene Synonym) and the horizontal plot by gene name, demonstrating the flexibility of the system. In both plots, the bcfAH operon can clearly be seen.

Coloring by COGs category

One of the benefits of using .ptt files for genome information is that we get additional information – namely, information from the COGs database, a database of clusters of orthologous proteins across species. More information on this database can be found in Tatusov et al (2003). The COGs database, amongst other things, assigns each gene one of 26 different functional catgories, and we may wish to colour our graphs by this categorisation.

The COGs database can be found at http://www.ncbi.nlm.nih.gov/COG/new/, and a list of the COGs functional categories at http://www.ncbi.nlm.nih.gov/COG/old/palox.cgi?fun=all. This data has already been included in ProGenExpress. To access it, type:

?cogstab

data(cogstab)

The dataset cogstab already has colors assigned to each category, by me, using the R function rainbow(). If you want different colors, you just need to create a vector of 26 colors and do something like:

cogstab$Color <- mycolors

Once you have the data loaded, the graphs can be colored by COGs category:

plotrange(linked,1,50000,col.cogs=cogstab,cnames=c("M4h","M8h","M12h"),

distscale=50)

The above command also plots a COGS legend:

Coloring by COGS category can be used with any of the above calls to plotrange, plotrange.bygene and plotrange.vertical.

Finding Interesting Regions

ProGenExpress provides a simple function for analysing combined genome and numerical data to search for regions of interest. These are defined as clusters of genes that are close together and that show similar patterns of behaviour in the numerical data. Clusters of genes which are close together and show similar behaviour may represent putative operons, pathogenicity islands or prophage sequences.

To use this function, type the following:

reg <- find.region(linked, cnames=c("M4h","M8h","M12h"), maxdist=150)

The “maxdist” argument specifies the maximum number of base pairs between the end of one gene and the start of the next for them to be considered “close”. Typing:

nrow(reg)

shows that “find.region” has identified 206 regions of interest. The first five of these can be shown by typing:

reg[1:5,]

This prints:

start stop

1 STM0021 STM0028.1n

2 STM0051 STM0054

3 STM0100 STM0103

4 STM0104 STM0108

5 STM0123 STM0128

To examine the first region, use another ProGenExpress utility function “show.region”:

show.region(linked,start="STM0021",stop="STM0028.1n",

cnames=c("M4h","M8h","M12h"))

This prints out the data for that region to the screen and draws a plot:

Here we can see the first region that find.region has identified is the bcf operon.

Examples

# this whole section can be cut and pasted into R

# first load the library and data

library(ProGenExpress)

data(IFR)

data(STLT2)

data(cogstab)

# link the data together

linked <- linkem.avg(STLT2,IFR,"Synonym","Synonym",cnames=c("M4h","M8h","M12h"))

# create an image of Spi-1

x11()

highlight <- data.frame(start="sitA", stop="invH", name="spi-1", color="yellow")

plotrange.bygene(linked, start="sitA", stop="invH", cnames=c("M4h","M8h","M12h"),

pad=5000, highlight=highlight, distscale=50)

# and another, this time colored by functional category

x11()

plotrange.bygene(linked, start="sitA", stop="invH", cnames=c("M4h","M8h","M12h"),

pad=5000, highlight=highlight, col.cogs=cogstab, cogs.legend=FALSE,

distscale=100)

# and plot the legend

x11()

plot.cogs.legend(cogstab)

# create an image of Spi-2

x11()

highlight <- data.frame(start="pykF", stop="ydhE", name=”spi-2”)

plotrange.bygene(linked, start="pykF", stop="ydhE", cnames=c("M4h","M8h","M12h"),

pad=5000, highlight=highlight, distscale=30)

# and another, this time colored by functional category

x11()

plotrange.bygene(linked, start="ydhE", stop="pykF", cnames=c("M4h","M8h","M12h"),

pad=5000, highlight=highlight, col.cogs=cogstab, cogs.legend=FALSE,

distscale=80)

# and plot the legend

x11()

plot.cogs.legend(cogstab)

# plot an image of the fli operon

highlight <- data.frame(start="fliE", stop="fliR")

x11()

plotrange.bygene(linked, start="fliE", stop="fliR", cnames=c("M4h","M8h","M12h"),

pad=5000, highlight=highlight, distscale=50)

# and another, this time colored by functional category

x11()

plotrange.bygene(linked, start="fliE", stop="fliR", cnames=c("M4h","M8h","M12h"),

pad=5000, highlight=highlight, col.cogs=cogstab, cogs.legend=FALSE,

distscale=50)

# and plot the legend

x11()

plot.cogs.legend(cogstab)

# plot the fli operon - vertically

x11()

plotrange.bygene(linked, start="fliE", stop="fliR", cnames=c("M4h","M8h","M12h"),

pad=5000, highlight=highlight, vertical=TRUE, distscale=50)

#ssaK, L, M, V, N, O, P, Q, R, S, T, U constitute one operon

highlight <- data.frame(start="ssaK", stop="ssaU", name=”ssa operon”)

x11()

plotrange.bygene(linked, start="ssaK", stop="ssaU",

cnames=c("M4h","M8h","M12h"), pad=5000,

highlight=highlight, match.gname="Gene",

display.gname="Gene", distscale=50)

# NB this section should only be attempted by those with a lot of RAM (256Mb+) and a good

# graphics card (if you are running windows)

# finally, plot the whole genome

# note SPI-9 starts at STM2689, but this is a pseudogene in typhimurium

# and so won't show up in the .ptt file

highlight <- data.frame(name=c("SPI-1","SPI-2","SPI-3","SPI-4","SPI-5","SPI-6","SPI-9",

"SPI-11"), start=c("STM2861","STM1379","STM3752","STM4257","STM1087","STM0266",

"STM2690","STM4488"),stop=c("STM2907","STM1422","STM3764","STM4262","STM1096",

"STM0307","STM2692","STM4498"))

# highlight looks like this:

# name start stop

#1 SPI-1 STM2861 STM2907

#2 SPI-2 STM1379 STM1422

#3 SPI-3 STM3752 STM3764

#4 SPI-4 STM4257 STM4262

#5 SPI-5 STM1087 STM1096

#6 SPI-6 STM0266 STM0307

#7 SPI-9 STM2690 STM2692

#8 SPI-11 STM4488 STM4498

# draw the whole genome horizontally to a file called myshinygenome.jpg

jpeg("myshinygenome.jpg",height=480,width=50000,quality=100)

par(xaxs="i")

plotrange(linked,start=1,stop=5000000,cnames=c("M4h","M8h","M12h"),

highlight=highlight, display.gname="Gene", gname="Synonym", distscale=50)

dev.off()

# draw the whole genome vertically to a file called myshinygenome_vertical.jpg

jpeg("myshinygenome_vertical.jpg",height=50000,width=480,quality=100)

par(yaxs="i")

plotrange.vertical(linked,start=1,stop=5000000,cnames=c("M4h","M8h","M12h"),

highlight=highlight, display.gname="Gene", gname="Synonym",

distscale=50)

dev.off()

# draw the whole genome vertically to a file called myshinygenome_vertical_cogs.jpg

# color the data by COGs category

jpeg("myshinygenome_vertical_cogs.jpg",height=50000,width=480,quality=100)

par(yaxs="i")

plotrange.vertical(linked,start=1,stop=5000000,cnames=c("M4h","M8h","M12h"),

highlight=highlight, display.gname="Gene", gname="Synonym",

col.cogs=cogstab, cogs.legend=FALSE, distscale=50)

dev.off()

# plot the COGs legend to a file called cogslegend.jpg

jpeg("cogslegend.jpg", quality=100)

plot.cogs.legend(cogstab)

dev.off()

References

Eriksson S., Lucchini S., Thompson A., Rhen M. & Hinton J. C. D.

(2003) Unravelling the biology of macrophage infection by gene

expression profiling of intracellular Salmonella enterica

Molecular Microbiology 47 (1) 103-118

Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin

EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS,

Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. (2003)

The COG database: an updated version includes eukaryotes. BMC

Bioinformatics. 2003 Sep 11;4(1):41.