Attempt to read an NCBI RefSeq entry into a data frame

read.GenBank(ProGenExpress)

R Documentation

Attempt to read an NCBI RefSeq entry into a data frame

Description

This function attempts to read an NCBI RefSeq entry's gene features into a data frame. Note I do not intend to support this function as the GenBank format is very variable. At time of writing this function works on NC_003197 and NC_003198 plus a few others. This should only be used if you are interested in genes other than those which are transcribed AND translated (e.g. you are interested in pseduogenes). For reading information on the genome/proteome organisation of a prokaryote I reccommend read.ptt

Usage

read.GenBank(file = NULL)

Arguments

file A GenBank file describing the genome. Thos from RefSeq work best

Details

The function looks for gene features within the feature table of a RefSeq entry. It creates two data frames and then joins them before returning the result.

The first data frame contains the location of the gene (which it gets from the gene feature), the gene name (which it gets from the /gene tag in the gene feature), the gene synonym (which it gets from the /locus_tag tag in the gene feature) and whether or not it is a pseudogene (which it gets by looking for a /pseudo tag within the gene feature. Examples which work well are:

        gene    complement(282467..284185)
                  /gene="proS"
                  /locus_tag="STM0242"
                  /db_xref="GeneID:1251760"

Examples that do not work so well are:

        gene    complement(1942582..1943223)
                  /gene="eda"
                  /locus_tag="kdgA"
                  /note="synonym: STY2091"
                  /pseudo
                  /db_xref="GeneID:1248435"

As can be seen in the second example, the synonym has been shifted into the /note tag and the /locus_tag tag has been used for an alternative gene name. There is nothing I can (or want) to do about this!

The second data frame is a mapping from synonym to product (the protein). For this, the Synonym is taken from the /locus_tag of the CDS feature, and the protein is taken from the /product tag of the CDS feature.

After both data frames are complete, they are merged on Synonym and returned.

NB this is not the most economical of functions and I am sure could be improved. It takes some time to run and uses quite a bit of memory

Value

A data frame is returned - see STLT2 for the details.

Note

This function is for demonstration purposes only. The GenBank format is so variable it will be almost impossible to support and maintain. Please use read.ptt

Author(s)

Michael Watson

References

http://www.ncbi.nlm.nih.gov/projecte/RefSeq/ http://www.ncbi.nlm.nih.gov/genomes/MICROBES/Complete.html

Examples


        # typhi
        #TYPHI <- read.GenBank("NC_003197.gbk")
        #TYPHI[1:5,]

        # typhimurium
        #TYPHIM <- read.GenBank("NC_003198.gbk")
        #TYPHIM[1:5,]

        # E coli
        #ECOLI <- read.GenBank("NC_000913.gbk")
        #ECOLI[1:5,]

        # look for pseudo genes
        #TYPHI[TYPHI$Pseduo == "pseudo",]

[Package ProGenExpress version 1.0 Index]