CAVEAT2 This is a demonstration of potentials for grid computing in bioinformatics. Currently it does not do much that is functional; hopefully in the Spring 2002 time frame it will have some usable functions for bioinformatics. But please view it with an eye to what can be, rather than what is, of use, and dont look too closely at the bugs and limitations.
This program requires a Java runtime (java or jre) program, preferrably version 1.3 or later. It likely will work with version 1.2, but not version 1.1.
This java version requirement precludes its use on Macintosh OS 9 or earlier, as Apple Computer will not make available a Java version 1.2+ You can use this program on MacOS X, where it works well.
The easiest way to run this is with Java Web Start, if that is installed on your computer. This is a standard part of Macintosh OS X 10.1. Web Start software is available for MS Windows and Unix systems at http://java.sun.com/products/javawebstart/ To use Web Start for launching BioGridRunner, find the Web Start ".jnlp" script here
You will probably find source code (Java 1.2/1.3) for BioGridRunner included with the authors java source library at ftp://iubio.bio.indiana.edu/molbio/java/source/ as iubiojava-src.zip This is a work in progress, done on a fluctuating schedule, updates will be available as time permits.
java -cp lib/biogridrun.jar:lib/readseq.jar:lib/xerces.jar:lib/cog912all.jar iubio.grid.app
(changing /: to \; for UNIX to MS DOS). If the program fails to
run, check that you have the above .jar files.
This client program aims to make it easy to find information, and move it from there to here, or there to elsewhere. Each resource has a URL, such as we know of from web hyperlinks, but extending to non-web GRID Internet resources. These URLs are the "name" attached to an object, whether data, software, computer disk or other resource. In this grid-runner, you can find these in directories, and move them among places using Drag'n'Drop methods to pull a URL from here to there.
Another basic part of this program is the ability to run other programs, given a description of how those programs operate. In this case we focus on command-line programs for bioinformatics: such as Clustal W (sequence alignment), EMBOSS and GCG sequence analysis packages, and others. BioGridRunner uses descriptions of how these programs run - their input data, command-line program options, and outputs. Given this description (now in an XML format "BIX" command script), this program allows you to run such bio-apps with a form or dialog to select options easily, and select or drag'n'drop data for input into the program. You can select to run these programs on your own computer or ones you have GRID credentials to run programs on.
Security and authenticated data and resource use is a basic part of GRID methods included here. Directories of data may include collaborative projects where you and others share data in a secure, authenticated way.
The program uses Drag'n'Drop methods, and will improve these so you can move things by their URL between directories / computers and into program jobs. Note that these drag and drop methods work across this program and others on your computer. You can drag files from your computer windows (Finder, MS Explorer) into this app, or drag URLs out of this app into a web browser, or other.
The application window has a menu with File, Grid Options, Help and Windows.
The lightweight directory protocol (LDAP) is being tested here as a primary way for organizing federations of bioinformatic data and software. The test services at ldap://iubio.bio.indiana.edu/ include directories of gigabytes of bioinformatic data from the Bio-Mirror project; software cataloged at the IUBio Archive; genome data from the euGenes eukaryote genome service.
These are and will change and be added to for testing of GRID-based data and software search and retrieval. One hope of this pilot test is to find methods whereby you can use this BioGridRunner to semi-automatically find current data of interest to you, and the software needed to analyze it, and move those data and software (with simple "Drag-n-Drop" visual methods) to the computer(s) you want to use them on. There is much behind the scenes programming and information engineering needed for this to work, but the LDAP, GRID and related tool sets make this all feasible now.
LDAP looks like an important method for automating methods of finding, searching and accessing current data in biology. LDAP provides means for searching among many computers, including globally linked ones, in ways that cannot be achieved with current Web or other means. It has been developed over the past decade for use with directories of people, computer resources, and other information, and has a range of methods for searching, joining disparate information sources, defining information objects and attributes, and offers security and wide spread software support.
Each directory window includes
Any directory object can be opened in a fashion consistent with its contents. An object with HTML contents will be displayed for reading; an object of biosequence contents will be displayed in a sequence viewer; an image file will be displayed graphically, and so forth. Complex objects are handled by software adaptors for specific functions, such as the BIX bioinformatics command objects which display as a command dialog allowing you to run the program. By default, any unknown object type can be copied (URL Copy) from one place to another.
Each directory, or any node within the directory, can be searched by its attributes. There are several menu commands pertaining to directory searching, which are currently in test mode, and may not function properly yet.
With LDAP directories, this search ability can span any number of remotely linked directory services. This ability allows one to find needed data or resources without knowing their location.
You can type a new URL into the editable line atop the directory window for a new item to open. This also holds a pop-up list of sample directory URLs. The drag-able URL label in a directory is one that you can use to drag the current URL to another window. The Open/Close button controls whether a directory is actively connected to its source - e.g., an anonymous FTP server. Examples of LDAP and FTP directories for bioinformatics include
After finding a suitable data package and site from the Bions server,
one can use the url attribute to open that data, as in this example
screen of the Bio-Mirror data service at
ftp://bio-mirror.jp.apan.net/pub/biomirror/
A test of genome data is available from ldap://eugenes.org:3891/
This service lets one search and retrieve genome features for several eukaryote
model organisms of the euGenes service.
Features can be searched by species, chromosome, feature kind, and sequence
map range.
For example, in a directory window locate a
popular biosequence data for BLAST searches
from a Bio-Mirror or other anonymous ftp source, such as
ftp://iubio.bio.indiana.edu/biomirror/blast/
Then select the specified data file, such as est.Z - current EST biosequence
databank in compressed form. Open the url to this data by selecting the
menu command Direcory Options/Open node, or double click its name in
directory window, or "url" attribute. This brings the URL Copy window
showing source and destination addresses for this data.
The credentials dialog allows you to create a proxy certificate for use with grid jobs, either from your locally installed certificates or obtaining one from a MyProxy server.
For testing purposes, the MyProxy server oat.bio.indiana.edu will
provide certificate proxies for anonymous uses. These proxies have
limited duration, and are available for user named "anonymous", with
password "anonymous". Servers currently accepting this proxy include
"microbe.bio.indiana.edu".
The Install Application function of BioGridRunner will create a samples folder that includes sample data files, and also the Bioinformatics command descriptions (BIX format, to be documented, in XML format). One way to ensure that the programs installed on your computer are found easily by BioGridRunner is to edit these bix.xml command files and set the path to match that on your computer. This install path problem will be addressed in later updates. Edit these path settings to match your install path:
samples.bix.xml: <env>expath=/bio/mb/bin/</env> <env>filepath=/bio/mb/</env> <env>datapath=/bio/mb/data/</env> <env>TACGLIB=/bio/mb/data/</env> <env>docpath=/bio/mb/docs/</env> emboss.bix.xml: <env>EMBOSSDIR=/bio/mb/emboss/</env> <env>empath=/bio/mb/emboss/bin/</env> <env>EMBOSS_ACDROOT=/bio/mb/emboss/share/EMBOSS/acd</env> <env>PLPLOT_LIB=/bio/mb/emboss/share/EMBOSS</env> <env>PATH=/bio/mb/emboss/bin/:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/local/sbin</env> ncbitools.bix.xml: <env>NCBI=/bio/mb/ncbi/ncbirc</env> <env>ncbipath=/bio/mb/ncbi/bin/</env>
To use ClustalW, you need to locate a data file of un-aligned
sequences, such as found in the BioGridRunner samples folder.
This samples folder contains the clustalw.pir file of 5 test sequences
for ClustalW. From the directory window of samples, open the clustalw.pir
file to display its sequences, as shown below.
The steps in running Clustal W involve opening (Directory options/Open Node) the job dialog for clustalw from the Sample Apps directory of programs. Then enter or drag-and-drop the file URL for clustalw.pir data into this job dialog "Input data" field. Other options for alignment can be selected in this job dialog.
At the top of this job dialog, there are buttons to Run, Test and select the Job Host. The name "localhost" designates the computer you are working on. Alternately, you can select a grid-aware server computer to run the job on. This also requires that you establish grid credentials for use of that computer (see above). At this writing, the MyProxy server at oat.bio.indiana.edu will provide anonymous users with credentials to run a few bioinformatics jobs on grid test computers at Indiana University Cntr. for Genomics and Bioinformatics. The "Test" button verifies that you are allowed to run a job on the computer, and that the program, e.g. ClustalW is available at the path specified.
This preliminary test comes with the Sample apps configured with a system path
of '/bio/mb/bin', so that ClustalW should be located as /bio/mb/bin/clustalw. If
you locate ClustalW in another folder, you can edit the environment settings for this
job, using the Job menu File3/Edit Environment option. See also the Job Options/Edit
command line option for another method to set the command-line for this job.
When the job is run, your selected data and program parameters are sent
on to the program. If the program is on a remote computer, data are transported
with grid transport protocols in a secure fashion to your remote account
work folder. Status of the job is shown at the bottom of the job dialog, and
when the job finishes, all resulting data files and job output are returned to
your computer for display or futher analysis, as shown below for ClustalW.
These windows showing the data selection and job dialog for ClustalW are
shown all together below.
For a more complex example, using the EMBOSS suite of bioinformatics programs,
one can search the program
directory for commands dealing with a topic of interest, such as sequence alignments, using
a search filter "(info=*align*)". Those commands matching this search filter are
listed, and can be selected then to run or learn more about from their help
documentation, as shown below.