Introduction
ProGenExpress is a package for the R statistical
software (http://www.R-project.org)
which has been written to aid researchers view numerical data (E.g. microarray
data) in the context of the bacterial genomes on which that data is based. A basic knowledge of R will be useful,
although this tutorial will attempt to cater for complete beginners.
Importing Microarray (numerical) Data
Data in ProGenExpress is stored
internally in a structure called a “data frame”. In R, type:
?data.frame
for more information on this
structure.
The first step is to start R and load the ProGenExpress package.
In R, type:
library(ProGenExpress)
ProGenExpress comes packaged with some data
already in a data frame, for use with this tutorial. To load this data, in R type:
data(IFR)
?IFR
IFR[1:5,]
This is data from Eriksson et al (2003) consisting of gene expression
measurements for Salmonella typhimurium at 4h, 8h and
12h following macrophage infection. The
data is stored in a data frame.
To read your own data into a data frame in R, the best
format to have your data in is either a tab-delimited text file, or an Excel
file, with each row representing a spot or gene on your microarray, and each
column representing a particular experiment or microarray. The function “read.table”
reads text files into a data frame, and the function “read.xls” in the gdata package reads excel files
into a data frame. Type the following
for more info:
library(gdata)
?read.table
?read.xls
For example, if your data is in a tab-delimited text file,
with the first row as column headings, you may want something like:
mydata <- read.table(“myfile.txt”,
header=TRUE, sep=”\t”, quote=”\””)
Reading a Genome
The next step is to read into R information relating to the
genome of the organism on which the (microarray) measurements have been carried
out. Again, ProGenExpress
stores this information in a data frame.
ProGenExpress comes packaged with data for the
Salmonella typhimurium
genome already included, and two functions which can create a data frame from
commonly available genome resources.
First, the internal data.
Type:
data(STLT2)
?STLT2
STLT2[1:5,]
The first five genes of the S typhimurium genome should be displayed,
giving information on the start and stop locations of each gene, strand,
length, PID, gene name, gene synonym, COGs code and
ID and the gene product. The data frame
has this information for all 4425 known expressed genes.
The above data frame was created using a ProGenExpress
function called “read.ptt”. The file that this function read is called
NC_003197.ptt, which is available at ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Salmonella_typhimurium_LT2/NC_003197.ptt. Similar files for many sequenced bacterial
genomes are available from this ftp site, and can be accessed by browsing ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. Once the .ptt file
for the genome of interest has been downloaded, then read.ptt
can be used:
?read.ptt
STLT2 <-
read.ptt(“NC_003197.ptt”)
STYPHI
<- read.ptt(“NC_003198.ptt”)
ECOLI <-
read.ptt(“NC_
000913.ptt”)
There is a wrapper function which allows the user to
download a .ptt file interactively using lists. To access this, type:
read.NCBI()
Finally, there is a function read.GenBank
which attempts to read whole genome GenBank files
from RefSeq into a similar data structure as produced
by read.ptt.
This is a very difficult task as RefSeq GenBank entries are variable in format. However, the function read.GenBank
has been shown to work with NC_003197, NC_003198 and NC_000913 Refseq genomes. To
access this function and its help, type:
?read.GenBank
STLT2 <-
read.GenBank(“NC_003197.gbk”)
STYPHI
<- read.GenBank(“NC_003198.gbk”)
ECOLI <-
read.GenBank(“NC_ 000913.gbk”)
These .gbk files are also
accessible from ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. The main advantage of using .gbk files instead of .ptt files
is that .ptt files, by definition, only deal with
genes which produce proteins, i.e. they do not contain information relating to pseudogenes. In my
experience, researchers who wish to view numerical data, such as microarray
data, in the context of the genome wish to deal only with expressed and
translated genes, and so this is not a problem.
However, in an attempt for completeness, read.GenBank
does try to gather information on the location of pseudogenes. Due to the complexity of the function and the
variability of the GenBank format, read.GenBank is provided “as is”, and I do not intend to
support it as well as the other functions.
There is nothing to stop users creating their own data
frames describing their genomes. The
data frame, as a minimum, must have columns Start, Stop,
Linking them together
We now have two data frames, one containing microarray data,
and the other containing the genome. We
need to link these two together so ProGenExpress can
draw the data. We wish to end up with a
new data frame, with one row per gene, and columns representing gene names and
locations, and the numerical data.
As many microarray experiments contain multiple spots for
each gene, we need a way of condensing these multiple measurements into a single
measurement. For simplicity, ProGenExpress takes the mean value.
In order to link rows from the microarray data to rows from
the genome data, the two data frames must have a column in common which tells ProGenExpress which rows correspond. As a default, ProGenExpress
uses the “Synonym” column where this exists, though users can define which
columns are used. To produce a linked
data frame, after loading the data (here we will use the provided datasets, IFR
and STLT2), type:
linked <- linkem.avg(STLT2,IFR,"Synonym","Synonym",
cnames=c("M4h","M8h","M12h"))
To access help on this function, type
?linkem.avg
The first and second arguments specify the genome data and
the microarray data to link together.
The third and fourth arguments indicate the columns in each data set
which identify equivalent rows. Finally,
the cnames argument specifies the column names in IFR
which contain the numerical data we are interested in.
Displaying by base number
Now we have the data in the format we need, we can display
it. There are two functions which do
this – plotrange and plotrange.vertical. To access help for these two, type:
?plotrange
?plotrange.vertical
There are many options to play with, but the simplest way of
calling these is:
plotrange(linked,start=1,stop=50000,cnames=c("M4h","M8h","M12h"))
This shows all genes within the base range 1 – 50000, with
one bar per gene per time point. The
above plot ignores intergenic distance. A more representative plot of the genome can
be produced by:
plotrange(linked,start=1,stop=50000,cnames=c("M4h","M8h","M12h"),distscale=50)
Here a scaling factor of 50 has been used, and the gaps
between the genes on the plot are representative of intergenic
distance. See:
?plotrange
for more detailed information.
ProGenExpress can also plot vertical images of
the genome:
plotrange.vertical(linked,start=1,stop=40000,cnames="M4h",
display.gname="Synonm",
posleft=FALSE,
distscale=100)
The call to plotrange.vertical
demonstrates some of the other arguments to the two functions – now only one
bar is displayed for each gene (we specified only “M4h” with the cnames argument), and the Synonym is displayed instead of
the gene name.
Finally, regions of the genome may be highlighted. These regions are identified by either gene
name or Synonym:
highlight <- data.frame(start="talB", stop="bcfH",
name="highlight", color="grey")
plotrange(linked,start=1,stop=50000, c("M4h","M8h","M12h")
,
display.gname="Synonym", highlight=highlight,
distscale=30, genomecol="dark
grey")
Displaying by gene name
Similar to the plotrange and plotrange.vertical functions, users can plot ranges of the
genome identified by gene names or synonyms.
The function is called plotrange.bygene, and
help can be accessed by typing:
?plotrange.bygene
The function uses plotrange and plotrange.vertical internally and hence takes similar
arguments and outputs similar figures:
plotrange.bygene(linked,start="thrL",stop="nhaR", cnames=c("M4h","M8h","M12h"),
distscale=50)
plotrange.bygene(linked,start="STM0018",stop="STM0040",
cnames=c("M4h","M8h","M12h"),
vertical=TRUE,
match.gname="Synonym", distscale=50)
The two graphs show similar plots, both scaled to show intergenic distance.
The vertical plot is selected by STM number (gene Synonym) and the
horizontal plot by gene name, demonstrating the flexibility of the system. In both plots, the bcfAH
operon can clearly be seen.
Coloring by COGs category
One of the benefits of using .ptt
files for genome information is that we get additional information – namely,
information from the COGs database, a database of
clusters of orthologous proteins across species. More information on this database can be
found in Tatusov et
al (2003). The COGs
database, amongst other things, assigns each gene one of 26 different
functional catgories, and we may wish to colour our
graphs by this categorisation.
The COGs database can be found at http://www.ncbi.nlm.nih.gov/COG/new/,
and a list of the COGs functional categories at http://www.ncbi.nlm.nih.gov/COG/old/palox.cgi?fun=all. This data has already been included in ProGenExpress. To
access it, type:
?cogstab
data(cogstab)
The dataset cogstab already has colors assigned to each category, by me, using the R
function rainbow().
If you want different colors, you just need to
create a vector of 26 colors and do something like:
cogstab$Color <- mycolors
Once you have the data loaded, the graphs can be colored by COGs category:
plotrange(linked,1,50000,col.cogs=cogstab,cnames=c("M4h","M8h","M12h"),
distscale=50)
The above command also plots a COGS legend:
Coloring by COGS category can be used with
any of the above calls to plotrange, plotrange.bygene and plotrange.vertical.
Finding Interesting Regions
ProGenExpress provides a simple function for
analysing combined genome and numerical data to search for regions of
interest. These are defined as clusters
of genes that are close together and that show similar patterns of behaviour in
the numerical data. Clusters of genes
which are close together and show similar behaviour may represent putative operons, pathogenicity islands or
prophage sequences.
To use this function, type the following:
reg <- find.region(linked,
cnames=c("M4h","M8h","M12h"),
maxdist=150)
The “maxdist” argument specifies the
maximum number of base pairs between the end of one gene and the start of the
next for them to be considered “close”.
Typing:
nrow(reg)
shows that “find.region”
has identified 206 regions of interest.
The first five of these can be shown by typing:
reg[1:5,]
This prints:
start stop
1 STM0021 STM0028.1n
2 STM0051 STM0054
3 STM0100 STM0103
4 STM0104 STM0108
5 STM0123 STM0128
To examine the first region, use another ProGenExpress
utility function “show.region”:
show.region(linked,start="STM0021",stop="STM0028.1n",
cnames=c("M4h","M8h","M12h"))
This prints out the data for that region to the screen and
draws a plot:
Here we can see the first region that find.region
has identified is the bcf operon.
Examples
# this whole section can be cut and
pasted into R
# first load the library and data
library(ProGenExpress)
data(IFR)
data(STLT2)
data(cogstab)
# link the data together
linked <- linkem.avg(STLT2,IFR,"Synonym","Synonym",cnames=c("M4h","M8h","M12h"))
# create an image of Spi-1
x11()
highlight <- data.frame(start="sitA", stop="invH",
name="spi-1", color="yellow")
plotrange.bygene(linked, start="sitA", stop="invH",
cnames=c("M4h","M8h","M12h"),
pad=5000,
highlight=highlight, distscale=50)
# and another, this time colored by functional category
x11()
plotrange.bygene(linked, start="sitA", stop="invH",
cnames=c("M4h","M8h","M12h"),
pad=5000, highlight=highlight, col.cogs=cogstab, cogs.legend=FALSE,
distscale=100)
# and plot the legend
x11()
plot.cogs.legend(cogstab)
# create an image of Spi-2
x11()
highlight <- data.frame(start="pykF", stop="ydhE",
name=”spi-2”)
plotrange.bygene(linked, start="pykF", stop="ydhE",
cnames=c("M4h","M8h","M12h"),
pad=5000,
highlight=highlight, distscale=30)
# and another, this time colored by functional category
x11()
plotrange.bygene(linked, start="ydhE", stop="pykF",
cnames=c("M4h","M8h","M12h"),
pad=5000, highlight=highlight, col.cogs=cogstab, cogs.legend=FALSE,
distscale=80)
# and plot the legend
x11()
plot.cogs.legend(cogstab)
# plot an image of the fli operon
highlight <- data.frame(start="fliE", stop="fliR")
x11()
plotrange.bygene(linked, start="fliE", stop="fliR",
cnames=c("M4h","M8h","M12h"),
pad=5000, highlight=highlight, distscale=50)
# and another, this time colored by functional category
x11()
plotrange.bygene(linked, start="fliE", stop="fliR",
cnames=c("M4h","M8h","M12h"),
pad=5000, highlight=highlight, col.cogs=cogstab, cogs.legend=FALSE,
distscale=50)
# and plot the legend
x11()
plot.cogs.legend(cogstab)
# plot the fli operon
- vertically
x11()
plotrange.bygene(linked, start="fliE", stop="fliR",
cnames=c("M4h","M8h","M12h"),
pad=5000, highlight=highlight, vertical=TRUE, distscale=50)
#ssaK, L, M, V, N, O, P, Q, R, S,
T, U constitute one operon
highlight <- data.frame(start="ssaK", stop="ssaU",
name=”ssa operon”)
x11()
plotrange.bygene(linked, start="ssaK", stop="ssaU",
cnames=c("M4h","M8h","M12h"),
pad=5000,
highlight=highlight, match.gname="Gene",
display.gname="Gene", distscale=50)
# NB this section should only be attempted by those with a
lot of RAM (256Mb+) and a good
# graphics card (if you are running
windows)
# finally, plot the whole genome
# note SPI-9 starts at STM2689, but
this is a pseudogene in typhimurium
# and so won't show up in the .ptt file
highlight <- data.frame(name=c("SPI-1","SPI-2","SPI-3","SPI-4","SPI-5","SPI-6","SPI-9",
"SPI-11"), start=c("STM2861","STM1379","STM3752","STM4257","STM1087","STM0266",
"STM2690","STM4488"),stop=c("STM2907","STM1422","STM3764","STM4262","STM1096",
"STM0307","STM2692","STM4498"))
# highlight looks like this:
# name start stop
#1
SPI-1 STM2861 STM2907
#2
SPI-2 STM1379 STM1422
#3
SPI-3 STM3752 STM3764
#4
SPI-4 STM4257 STM4262
#5
SPI-5 STM1087 STM1096
#6
SPI-6 STM0266 STM0307
#7
SPI-9 STM2690 STM2692
#8 SPI-11 STM4488 STM4498
# draw the whole genome
horizontally to a file called myshinygenome.jpg
jpeg("myshinygenome.jpg",height=480,width=50000,quality=100)
par(xaxs="i")
plotrange(linked,start=1,stop=5000000,cnames=c("M4h","M8h","M12h"),
highlight=highlight, display.gname="Gene",
gname="Synonym", distscale=50)
dev.off()
# draw the whole genome vertically
to a file called myshinygenome_vertical.jpg
jpeg("myshinygenome_vertical.jpg",height=50000,width=480,quality=100)
par(yaxs="i")
plotrange.vertical(linked,start=1,stop=5000000,cnames=c("M4h","M8h","M12h"),
highlight=highlight, display.gname="Gene",
gname="Synonym",
distscale=50)
dev.off()
# draw the whole genome vertically
to a file called myshinygenome_vertical_cogs.jpg
# color the data by COGs category
jpeg("myshinygenome_vertical_cogs.jpg",height=50000,width=480,quality=100)
par(yaxs="i")
plotrange.vertical(linked,start=1,stop=5000000,cnames=c("M4h","M8h","M12h"),
highlight=highlight, display.gname="Gene",
gname="Synonym",
col.cogs=cogstab, cogs.legend=FALSE, distscale=50)
dev.off()
# plot the COGs legend to a file
called cogslegend.jpg
jpeg("cogslegend.jpg",
quality=100)
plot.cogs.legend(cogstab)
dev.off()
References
Eriksson S., Lucchini S., Thompson
A., Rhen M. & Hinton J. C. D.
(2003) Unravelling the biology of macrophage
infection by gene
expression profiling of intracellular Salmonella enterica
Molecular Microbiology 47 (1) 103-118
Tatusov RL,
EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS,
Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale
DA. (2003)
The COG database: an updated version
includes eukaryotes. BMC
Bioinformatics. 2003 Sep 11;4(1):41.