read.GenBank(ProGenExpress) | R Documentation |
This function attempts to read an NCBI RefSeq entry's gene features into a data frame.
Note I do not intend to support this function as the GenBank format is very variable.
At time of writing this function works on NC_003197 and NC_003198 plus a few others.
This should only be used if you are interested in genes other than those which are
transcribed AND translated (e.g. you are interested in pseduogenes).
For reading information on the genome/proteome organisation of a prokaryote I reccommend
read.ptt
read.GenBank(file = NULL)
file |
A GenBank file describing the genome. Thos from RefSeq work best |
The function looks for gene features within the feature table of a RefSeq entry. It creates two data frames and then joins them before returning the result.
The first data frame contains the location of the gene (which it gets from the gene feature), the gene name (which it gets from the /gene tag in the gene feature), the gene synonym (which it gets from the /locus_tag tag in the gene feature) and whether or not it is a pseudogene (which it gets by looking for a /pseudo tag within the gene feature. Examples which work well are:
gene complement(282467..284185) /gene="proS" /locus_tag="STM0242" /db_xref="GeneID:1251760"
Examples that do not work so well are:
gene complement(1942582..1943223) /gene="eda" /locus_tag="kdgA" /note="synonym: STY2091" /pseudo /db_xref="GeneID:1248435"
As can be seen in the second example, the synonym has been shifted into the /note tag and the /locus_tag tag has been used for an alternative gene name. There is nothing I can (or want) to do about this!
The second data frame is a mapping from synonym to product (the protein). For this, the Synonym is taken from the /locus_tag of the CDS feature, and the protein is taken from the /product tag of the CDS feature.
After both data frames are complete, they are merged on Synonym and returned.
NB this is not the most economical of functions and I am sure could be improved. It takes some time to run and uses quite a bit of memory
A data frame is returned - see STLT2
for the details.
This function is for demonstration purposes only. The GenBank format is so variable
it will be almost impossible to support and maintain. Please use read.ptt
Michael Watson
http://www.ncbi.nlm.nih.gov/projecte/RefSeq/ http://www.ncbi.nlm.nih.gov/genomes/MICROBES/Complete.html
# typhi #TYPHI <- read.GenBank("NC_003197.gbk") #TYPHI[1:5,] # typhimurium #TYPHIM <- read.GenBank("NC_003198.gbk") #TYPHIM[1:5,] # E coli #ECOLI <- read.GenBank("NC_000913.gbk") #ECOLI[1:5,] # look for pseudo genes #TYPHI[TYPHI$Pseduo == "pseudo",]