Fellow GCGer, Thanks for your interest in the programs I wrote to make it easier to use the GenBank BLAST database searching tool that is available via Email. The procedures are in the following order, separated by a string of 20 asterisks: BLASTMAIL.COM BLASTMAILQ.COM TOFASTA.FOR WHAT THESE SHELLS DO -------------------- Early in September 1991 GenBank made available the BLAST database searching service that can be accessed via electronic mail. There are no charges to use this service. These programs construct messages in the format expected by GenBnak and send them by VMS mail. The users do not have to be familiar with the format of the message that GenBank expects, nor with sending email across the network; however, they do need to know how to read the mail that GenBank returns. BLASTMAIL.COM initiates a BLAST search of one of the GenBank databases. It enquires about the sequence name and search region, converts the sequence to Fasta format (required by the GenBank system), constructs the message and mails it off. BLASTMAILQ.COM sends a message to the GenBank BLAST server inquiring about the state of the queue. This is used to determine where your job is waiting in the queue. TOFASTA.FOR is a program that converts a GCG sequence to the Fasta format expected by the GenBank server. It also returns some values which the BLASTMAIL.COM procedure needs. This program will have to be compiled and linked under the GCG environment. GenBank also provides a Fasta database searching program that is accessible via email. Both Fasta and BLAST search more recent databases than we keep locally, offload the processing from our VAX, and are very fast. BLAST has the advantage over Fasta that it is about 10 times faster for equal sensitity for distant matches, and searches both strands, whereas Fasta only searches one strand. BLAST should return your results in a matter of minutes. To get more information about the BLAST program, GenBank will send you a short document describing it if you send the word HELP on the first line of an email message (with no subject: line) to BLAST@GENBANK.BIO.NET. I strongly recommend that you do this. INSTALLATION ------------ I put the command procedures in a subdirectory of my personal account, away from the rest of the GCG stuff. This is so that 1) I can find them if I need to make changes, and 2) so they don't get touched by anything on the GCG update tapes (not likely, but always possible). The appropriate initializing GCG command procedure has to be edited so that the symbols BLASTMAIL and BLASTMAILQ point to the shells. You will also have to add a symbol for the program TOFASTA, such as $ TOFASTA :== $device:[directory]TOFASTA Don't forget the $ sign in front of the device name or logical. TOFASTA must be compiled and linked. This is done by initiating the GCG support environment with the command GCGSUPPORT, compiling the program (FORTRAN/EXTEND TOFASTA) and linking it (GENLINK TOFASTA). The genlink command knows where all the appropriate object libraries are kept. The two command procedures use a logical name called SEARCH_ADDRESS that is equated to the network address of the GenBank server. This WILL have to be changed to accomodate your local conditions, such as gateways, or whatever. You will have to determine what this should be set to for yourself. OPERATION --------- If the installation is done properly, you should be able to run the shells just by typing their names, as with other GCG programs. They are completely menu-driven. The best way to see how they work is to try them. BLASTMAIL allows you to specify on the command line the sequence in question. For example, to have GenBank perform a BLAST search of a sequence called nobel_sexy.seq, typing "blastmail nobel_sexy.seq" will result in the shell skipping the prompt asking for the sequence name. MODIFICATIONS, UPDATES, BUGS ---------------------------- If you find that the shells don't support all the options you need, feel free to add to them. It should be easy to find where to make changes because the programs are pretty well commented, as these things go, and besides, isn't it true that DCL is the only real self-documenting language? I only request that you leave the lines at the top of the files that reference me as the author, and the institution where I work, unless you make such hash out of them that they no longer work, in which case I would be happy to relinquish credit. Alternatively, send suggestions for enhancements to me, and I'll consider them. If you find any bugs, please let me know so I can fix them and alert the other users of these shells. Stephen Clark, Ph.D. Division of Biological Research The Ontario Cancer Institute 500 Sherbourne St Toronto, Ontario Canada M4X 1K9 clark@galen.oci.utoronto.ca (Internet) clark@utoroci (Netnorth/Bitnet) ******************** $ ! BLASTMAIL.COM $ $ ! September 28, 1991 $ $ ! Written by Steve Clark $ ! The Ontario Cancer Institute, Toronto, Canada $ $ ! Command procedure to send a sequence to GenBank to have a BLAST $ ! search performed on it. THis procedure asks all the relevant $ ! questions, constructs a text file with the sequence in native Fasta $ ! format, and mails it to GenBank. It accepts the name of the query $ ! sequence on the command line as P1. $ ! Note: the logical name SEARCH_ADDRESS is the network address for $ ! the Genbank BLAST Search service. This will have to be changed to $ ! accomodate local gateways, etc. $ $ on control_y then goto terminate $ bell[0,7] = 7 $ ws := "write sys$output" $ iq := inquire/nopunctuation $ $ ! The Internet address for sending the search file is $ ! BLAST@GENBANK.BIO.NET $ $ define/nolog search_address "smtp%""blast@genbank.bio.net"" $ $ ws "" $ ws "This procedure initiates a BLAST search for similarity between" $ ws "your query sequence and one of the databases maintained by GenBank." $ ws "The information required for executing the search is sent to" $ ws "GenBank via electronic mail and is executed by the GenBank people" $ ws "themselves. Their databases are more current than our local" $ ws "ones, and their computer is very fast. The results of the search " $ ws "will be returned to you via e-mail." $ ws "" $ ws "The GenBank BLAST server searches both strands of your query" $ ws "sequence." $ ws "" $ ws "" $ $ ! TOFASTA propts for the sequence name (if not specified on the command $ ! line) and the region to search. It does all the error checking and $ ! returns all the relevant info to this procedure via global symbols. $ $ assign/usermode tt: sys$input $ tofasta/seqinfo/noreverse 'p1' $ if(seqinfotype.EQS."NONE") then exit ! Error from within Tofasta $ $get_database: $ $ ! If the sequence is DNA, the GenBank and EMBL databases are $ ! available. IF it is protein, the user can choose between the $ ! PIR and SWISSPROT databases. $ $ if(seqinfotype.EQS."PROTEIN") $ then $ ws "" $ ws "Database to search:" $ ws "" $ ws "1) Swiss-Prot" $ ws "2) PIR (NBRF)" $ ws "" $ iq choice "Please enter your choice (* 1 *): " $ if(choice.EQS."") then choice := 1 $ database := "" $ if(choice.EQS."1") then database := "SWISS-PROT" $ if(choice.EQS."2") then database := "PIR" $ if(database.NES."") then goto summarize $ ws "''bell'Valid responses are 1 or 2." $ goto get_database $ else $ ws "" $ ws "Database to search:" $ ws "" $ ws "1) GenBank" $ ws "2) EMBL" $ ws "" $ iq choice "Please enter your choice (* 1 *): " $ if(choice.EQS."") then choice := 1 $ database := "" $ if(choice.EQS."1") then database := "GENBANK" $ if(choice.EQS."2") then database := "EMBL" $ if(database.NES."") then goto summarize $ ws "''bell'Valid responses are 1 or 2." $ goto get_database $ endif $ $summarize: $ $ ws "" $ ws "" $ ws "The following BLAST search will be executed:" $ ws "" $ ws "Query sequence: ''seqinfoiname' from ''seqinfostart' to ", - "''seqinfoend' of ''seqinfolength' (''seqinfotype')" $ ws "Database to be searched: ''database'" $ ws "" $ iq choice "Are these parameters correct (* Yes *)? " $ choice = f$extract(0, 1, choice) $ if(choice.EQS."") then goto do_it $ if(choice.EQS."Y") then goto do_it $ $ ! Something is wrong. Give the chance to correct it, or give up. $ $ask_repeat: $ $ ws "" $ ws "Do you want to" $ ws "" $ ws "1) Try again" $ ws "2) Give up" $ ws "" $ iq choice "Please enter the number of your choice (* 1 *): " $ if(choice.eqs."") then goto get_query $ if(choice.eqs."1") then goto get_query $ if(choice.eqs."2") then exit $ ws "''bell'Wasn't the question simple enough for you?" $ goto ask_repeat $ $do_it: $ $ ! Write the commands to the BLAST server to a file, append the $ ! sequence to it, then mail it to GenBank. $ $ open/write outfile blastmail.txt $ if(seqinfotype.EQS."DNA") $ then $ write outfile "BLASTPROGRAM blastn" $ write outfile "DATALIB ''database'" $ else $ write outfile "BLASTPROGRAM blastp" $ write outfile "DATALIB ''database'" $ endif $ write outfile "BEGIN" $ close outfile $ convert/append 'seqinfooname' blastmail.txt $ $ ! Mail the file away. $ $ ws "Mailing the file to GenBank..." $ $ mail blastmail.txt search_address $ deassign search_address $ $ ws "The file ''seqinfoiname' has been sent to GenBank." $ ws "The results will be mailed back to you in a few minutes." $ ws "" $ $ delete 'seqinfooname';0 $ delete blastmail.txt;* $ $ terminate: ! Jump here on ^y. Don't do any cleanup $ $ exit ******************** $ ! BLASTMAILQ.COM $ $ ! September 8, 1991 $ $ ! Written by Steve Clark $ ! The Ontario Cancer Institute, Toronto, Canada $ $ ! Command procedure to mail to the GenBank BLAST server a query about $ ! the status of the searches waiting in the queues. $ ! It works by constructing a text file containing the single word $ ! "QUEUE" and mailing it to BLAST@GENBANK.BIO.NET $ $ ! NOTE: The symbol SEARCH_ADDRESS is the network address for the $ ! GenBank search service. This will have to be changed to accomodate $ ! local gateways, etc. $ $ define/nolog search_address "smtp%""blast@genbank.bio.net"" $ $ ws := "write sys$output" $ ws "" $ ws "This procedure inquires about the status of the BLAST server" $ ws "at GenBank via electronic mail so you can determine where your" $ ws "searches are waiting in the queues." $ ws "" $ inquire/nopunctuation choice "Do you want to continue (* Yes *)? " $ if(choice.EQS."") then choice := YES $ if(f$extract(0,1,choice).NES."Y") then exit $ ws "" $ ws "Constructing message file..." $ open/write comfile search_queue.txt $ write comfile "QUEUE" $ close comfile $ ws "Mailing message file..." $ mail search_queue.txt search_address $ deassign search_address $ delete search_queue.txt;* $ ws "" $ ws "GenBank will mail back the search queue status shortly." $ exit ******************** !*** TOFASTA *********************************************************** !* !* This program converts a standard GCG file sequence to FASTA format, !* as required by the GenBank BLAST server for database searching. With !* no command line switches, the program asks for the sequence filename, !* the regions to convert, and whether or not it should be reversed. The !* output filename is root.FASEQ. !* !* The following command line switches are available: !* !* /INfile = filename Suppresses the request for the input filename !* /NOREVerse Forces the top strand to be output !* /SEQINFO Sets symbols that can be used in command shells: !* SEQINFOINAME Input sequence filename !* SEQINFOONAME Output sequence filename !* SEQINFOTYPE "PROTEIN", "DNA", or "NONE" on error !* SEQINFOSTART !* SEQINFOEND !* SEQINFOLENGTH !* SEQINFOREV !* !* Written by Steve Clark September 7, 1991 !* !************************************************************************* program tofasta implicit none integer infile, lseq, rpos, lpos, l, i integer inttostr, str_len, revseq, getstring character inname(256), outname(256), seq(100001), text(33) byte bytename(256) logical seqinfo, logstatus, reverse logical clnoarg, isprotein, dclsetsymbol, clgetoldfname c Check for the command line switch /SEQINFO to see if the symbols should c be set for using in a command shell. seqinfo = .false. if(clnoarg('SEQINFO')) then seqinfo = .true. logstatus = dclsetsymbol('seqinfotype', 'NONE') endif if(.not.seqinfo) then call writef( & '\nTOFASTA converts a GCG format sequence to the native Fasta format.\n') endif c Look for the input filename on the command line. If not found, ask for it. if(.not.clgetoldfname('INfile', 1, inname)) then call writef('\nTOFASTA of what GCG sequence? ') if(getstring(inname).eq.0) stop ' ' endif c Open the file and read in the sequence. call openfile(infile, inname, 'rdb') call readseq(infile, seq, lseq) call closef(infile) c Get the range and revere if not prevented by the command line argument. call getrange(lpos, rpos, lseq) reverse = .false. if(.not.clnoarg('NOREVERSE')) call getreverse(reverse) c We have all the info we need. Calculate the output filename. call strcopy(outname, inname) call newfiletype(outname, '.faseq') c Set the SEQINFO symbols if required. if(seqinfo) then logstatus = dclsetsymbol('seqinfoiname', inname) logstatus = dclsetsymbol('seqinfooname', outname) l = inttostr(lpos, text) logstatus = dclsetsymbol('seqinfostart', text) l = inttostr(rpos, text) logstatus = dclsetsymbol('seqinfoend', text) l = inttostr(lseq, text) logstatus = dclsetsymbol('seqinfolength', text) logstatus = dclsetsymbol('seqinforev', 'FALSE') if(reverse) logstatus = dclsetsymbol('seqinforev', 'TRUE') logstatus = dclsetsymbol('seqinfotype', 'DNA') if(isprotein(seq)) & logstatus = dclsetsymbol('seqinfotype', 'PROTEIN') endif c Open the output file and write the first line which consists of a ">" and c the sequence name. This is followed by a space and the region that was c included in the conversion. l = str_len(outname) do i=1, l bytename(i) = ichar(outname(i)) enddo open (unit=1, file=bytename, type='new', carriagecontrol='list') if(.not.reverse) then write(1,1010) (outname(i), i=1, l), lpos, rpos 1010 format('>', a1, ' From', i6, ' to', i6) else write(1,1011) (outname(i), i=1, l), lpos, rpos 1011 format('>',a1,' From',i6,' to',i6,' Reverse orientation') l = revseq(seq, lpos, rpos) endif c Now write out the sequence, 70 characters to a line with no spaces. do while (lpos.le.rpos) l = min(lpos+69, rpos) write(1,1020) (seq(i), i=lpos, l) 1020 format(70a1) lpos = l + 1 enddo close (unit=1) if(.not.seqinfo) call writef('\nSequence written to %s.\n', outname) stop ' ' end .