read.ptt(ProGenExpress) | R Documentation |
Read an NCBI .ptt file into a data frame
read.ptt(file = NULL, idlimit = NULL, id = "Synonym")
file |
The location of the file |
idlimit |
Quite often, IDs are appended with "extra" information E.g. STM0007.1c. If this is the case, quite often you may only want the 'STM0007' bit and disregard the '.1c' bit. In this case, set the idlimit to 7 i.e. only take the first 7 characters of the ID |
id |
The column within the .ptt file to treat as the ID if idlimit is set (defaults to "Synonym" |
This simply reads an NCBI .ptt file into a data frame. In all cases, the unaltered Synonym column is used for the row names. If idlimit is set then a new column is created called "oldid" which contains the unaltered column specified by the id argument. The column specified by the id argument is then truncated to idlimit characters.
Please note that NCBI .ptt files contain details of proteins, which means that genes which have been identified as pseudogenes will not be present in .ptt files. If you want to try and pick up pseudogenes, you can either append them to the results of read.ptt or use read.GenBank (though that is unsupported currently). However, if visualising microarray data, I am unconvinced of the need to visualise pseudogene expression.
NCBI .ptt files can be found by browsing the ftp site at ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/
A data frame is returned. See STLT2
for details
Michael Watson
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003 Sep 11;4(1):41.
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/
data(STLT2) # produced by STLT2 <- read.ptt("NC_003197.ptt") # view the first five proteins STLT2[1:5,]