BioGridRunner
Bioinformatics Data Grid application

CAVEAT

This is a work in progress; an experimental program. No warranty is made for its operations. It includes methods which view and manipulate files, on your computer and others. The author believes it is safe for testing, but cannot promise use will not damage your files or data inadvertantly.

CAVEAT² This is a demonstration of potentials for grid computing in bioinformatics. Currently it does not do much that is functional; hopefully in the Spring 2002 time frame it will have some usable functions for bioinformatics. But please view it with an eye to what can be, rather than what is, of use, and dont look too closely at the bugs and limitations.

Introduction

This is a distributed computing application for bioinformatics, incorporation directory services (data and software), grid computing methods (security, authentication, data transport and remote jobs), and gene sequence and genomic data processing methods.

This program requires a Java runtime (java or jre) program, preferrably version 1.3 or later. It likely will work with version 1.2, but not version 1.1.

This java version requirement precludes its use on Macintosh OS 9 or earlier, as Apple Computer will not make available a Java version 1.2+ You can use this program on MacOS X, where it works well.

Fetching

The current home of this package is at http://iubio.bio.indiana.edu/grid/runner/

The easiest way to run this is with Java Web Start, if that is installed on your computer. This is a standard part of Macintosh OS X 10.1. Web Start software is available for MS Windows and Unix systems at http://java.sun.com/products/javawebstart/ To use Web Start for launching BioGridRunner, find the Web Start ".jnlp" script here

You will probably find source code (Java 1.2/1.3) for BioGridRunner included with the authors java source library at ftp://iubio.bio.indiana.edu/molbio/java/source/ as iubiojava-src.zip This is a work in progress, done on a fluctuating schedule, updates will be available as time permits.

Starting

If you use Web Start, that includes methods to check for software updates, download all the Java archive files needed, and launch the program. If you prefer not to use this method, there are command-line scripts for Unix and MS Windows in the home folder
Unix: http://iubio.bio.indiana.edu/grid/runner/biogridrun.sh
MSWin: http://iubio.bio.indiana.edu/grid/runner/biogridrun.bat
With these you need also the fetch the contents of the http://iubio.bio.indiana.edu/grid/runner/lib/ folder with its java archives. Then you can run the program by running these scripts. For the hardy command-liner, these are equivalent currently to this


java -cp lib/biogridrun.jar:lib/readseq.jar:lib/xerces.jar:lib/cog912all.jar iubio.grid.app

(changing /: to \; for UNIX to MS DOS). If the program fails to run, check that you have the above .jar files.

Functions

A central theme of this program is directories of information, bio-data and software, on your computer and on bioinformatics services around the globe.

This client program aims to make it easy to find information, and move it from there to here, or there to elsewhere. Each resource has a URL, such as we know of from web hyperlinks, but extending to non-web GRID Internet resources. These URLs are the "name" attached to an object, whether data, software, computer disk or other resource. In this grid-runner, you can find these in directories, and move them among places using Drag'n'Drop methods to pull a URL from here to there.

Another basic part of this program is the ability to run other programs, given a description of how those programs operate. In this case we focus on command-line programs for bioinformatics: such as Clustal W (sequence alignment), EMBOSS and GCG sequence analysis packages, and others. BioGridRunner uses descriptions of how these programs run - their input data, command-line program options, and outputs. Given this description (now in an XML format "BIX" command script), this program allows you to run such bio-apps with a form or dialog to select options easily, and select or drag'n'drop data for input into the program. You can select to run these programs on your own computer or ones you have GRID credentials to run programs on.

Security and authenticated data and resource use is a basic part of GRID methods included here. Directories of data may include collaborative projects where you and others share data in a secure, authenticated way.

The program uses Drag'n'Drop methods, and will improve these so you can move things by their URL between directories / computers and into program jobs. Note that these drag and drop methods work across this program and others on your computer. You can drag files from your computer windows (Finder, MS Explorer) into this app, or drag URLs out of this app into a web browser, or other.

The application window has a menu with File, Grid Options, Help and Windows.

Directories

Program operations center around directories of information, including data and program files on your computer, and on remote computers available thru FTP (file transport), and GSIFTP (Grid secure file transport). Directories can be openned for these kinds of URLs:

local files and folders (file:)
LDAP information directory servers (ldap://)
anonymous file transfer servers (ftp://)
secure, authenticated grid file transfer (gsiftp://)
Bioinformatics command objects (bix://)

The lightweight directory protocol (LDAP) is being tested here as a primary way for organizing federations of bioinformatic data and software. The test services at ldap://iubio.bio.indiana.edu/ include directories of gigabytes of bioinformatic data from the Bio-Mirror project; software cataloged at the IUBio Archive; genome data from the euGenes eukaryote genome service.

These are and will change and be added to for testing of GRID-based data and software search and retrieval. One hope of this pilot test is to find methods whereby you can use this BioGridRunner to semi-automatically find current data of interest to you, and the software needed to analyze it, and move those data and software (with simple "Drag-n-Drop" visual methods) to the computer(s) you want to use them on. There is much behind the scenes programming and information engineering needed for this to work, but the LDAP, GRID and related tool sets make this all feasible now.

LDAP looks like an important method for automating methods of finding, searching and accessing current data in biology. LDAP provides means for searching among many computers, including globally linked ones, in ways that cannot be achieved with current Web or other means. It has been developed over the past decade for use with directories of people, computer resources, and other information, and has a range of methods for searching, joining disparate information sources, defining information objects and attributes, and offers security and wide spread software support.

File/New Directory

This menu command allows you to open directories of information, data files, bioinformatics programs, and other networked and local resources that you want to compute with.

Each directory window includes

menu commands relevant to directories,
a URL line with

a drag-able URL label
an editable box for the current URL
an Open/Close button

a left panel listing directory items in tree form
a right panel showing item attributes

The item attribute panel will include various attribute kinds and values, including its name, its url, its class, and such things as size, content type, and others, depending on the class of the object selected in the directory window.

Any directory object can be opened in a fashion consistent with its contents. An object with HTML contents will be displayed for reading; an object of biosequence contents will be displayed in a sequence viewer; an image file will be displayed graphically, and so forth. Complex objects are handled by software adaptors for specific functions, such as the BIX bioinformatics command objects which display as a command dialog allowing you to run the program. By default, any unknown object type can be copied (URL Copy) from one place to another.

Each directory, or any node within the directory, can be searched by its attributes. There are several menu commands pertaining to directory searching, which are currently in test mode, and may not function properly yet.

With LDAP directories, this search ability can span any number of remotely linked directory services. This ability allows one to find needed data or resources without knowing their location.

You can type a new URL into the editable line atop the directory window for a new item to open. This also holds a pop-up list of sample directory URLs. The drag-able URL label in a directory is one that you can use to drag the current URL to another window. The Open/Close button controls whether a directory is actively connected to its source - e.g., an anonymous FTP server. Examples of LDAP and FTP directories for bioinformatics include

Bioinformatics name service test

This test service at ldap://iubio.bio.indiana.edu:3891/o=Bions is an example collection of biioinformatics data packages and sites for those packages.

After finding a suitable data package and site from the Bions server, one can use the url attribute to open that data, as in this example screen of the Bio-Mirror data service at ftp://bio-mirror.jp.apan.net/pub/biomirror/

A test of genome data is available from ldap://eugenes.org:3891/ This service lets one search and retrieve genome features for several eukaryote model organisms of the euGenes service. Features can be searched by species, chromosome, feature kind, and sequence map range.

URL Copy

This is a simple way to copy any directory or network object specified from location to another, whether on your computer or other Internet computers. The directory / network protocols currently supported include

local files and folders (file:)
anonymous file transfer servers (ftp://)
secure, authenticated grid file transfer (gsiftp://)

Web (http://) and directory (ldap://) planned for copy support. The method uses "third party transfers" so that if you copy between two remote computers, the data efficiently moves between those two without touching your computer.

For example, in a directory window locate a popular biosequence data for BLAST searches from a Bio-Mirror or other anonymous ftp source, such as ftp://iubio.bio.indiana.edu/biomirror/blast/

Then select the specified data file, such as est.Z - current EST biosequence databank in compressed form. Open the url to this data by selecting the menu command Direcory Options/Open node, or double click its name in directory window, or "url" attribute. This brings the URL Copy window showing source and destination addresses for this data.

Grid options/Grid Credentials

-- to use data and programs on remote computers, a necessary preliminary is secure, authenticated use of these. The Globus GRID package is used by this application for such. Grid credentials in the form of Certificate Authority signed digital certificates need to be installed by you on computers which you will use for secure access.

The credentials dialog allows you to create a proxy certificate for use with grid jobs, either from your locally installed certificates or obtaining one from a MyProxy server.

For testing purposes, the MyProxy server oat.bio.indiana.edu will provide certificate proxies for anonymous uses. These proxies have limited duration, and are available for user named "anonymous", with password "anonymous". Servers currently accepting this proxy include "microbe.bio.indiana.edu".

Grid options/Grid Job

Included is a basic grid job dialog from the Globus COG toolkit. This dialog allows you to run command-line jobs on any grid compute server that accepts your credentials.

Bioinformatics program dialogs

Installing bioinformatics apps

A necessary prerequisite to running ClustalW or other bioinformatics programs on your computer is installing these. For this preliminary test, you can find these programs for installing at
ftp://iubio.bio.indiana.edu/biogrid/bioapps/
bin/ -- compiled binaries
src/ -- source binaries

The Install Application function of BioGridRunner will create a samples folder that includes sample data files, and also the Bioinformatics command descriptions (BIX format, to be documented, in XML format). One way to ensure that the programs installed on your computer are found easily by BioGridRunner is to edit these bix.xml command files and set the path to match that on your computer. This install path problem will be addressed in later updates. Edit these path settings to match your install path:

samples.bix.xml:
<env>expath=/bio/mb/bin/</env>
<env>filepath=/bio/mb/</env>
<env>datapath=/bio/mb/data/</env>
<env>TACGLIB=/bio/mb/data/</env>
<env>docpath=/bio/mb/docs/</env>

emboss.bix.xml: 
<env>EMBOSSDIR=/bio/mb/emboss/</env>
<env>empath=/bio/mb/emboss/bin/</env>
<env>EMBOSS_ACDROOT=/bio/mb/emboss/share/EMBOSS/acd</env>
<env>PLPLOT_LIB=/bio/mb/emboss/share/EMBOSS</env>
<env>PATH=/bio/mb/emboss/bin/:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/local/sbin</env>

ncbitools.bix.xml:
<env>NCBI=/bio/mb/ncbi/ncbirc</env>
<env>ncbipath=/bio/mb/ncbi/bin/</env>

Sample bioinformatics apps

Select the GridRunner menu of Bio Apps/Samples. This opens a command diretory, where you can browse or search for programs of interest. For example, the Clustal multiple alignment program is shown, in the Sequence Alignment category.

To use ClustalW, you need to locate a data file of un-aligned sequences, such as found in the BioGridRunner samples folder. This samples folder contains the clustalw.pir file of 5 test sequences for ClustalW. From the directory window of samples, open the clustalw.pir file to display its sequences, as shown below.

The steps in running Clustal W involve opening (Directory options/Open Node) the job dialog for clustalw from the Sample Apps directory of programs. Then enter or drag-and-drop the file URL for clustalw.pir data into this job dialog "Input data" field. Other options for alignment can be selected in this job dialog.

At the top of this job dialog, there are buttons to Run, Test and select the Job Host. The name "localhost" designates the computer you are working on. Alternately, you can select a grid-aware server computer to run the job on. This also requires that you establish grid credentials for use of that computer (see above). At this writing, the MyProxy server at oat.bio.indiana.edu will provide anonymous users with credentials to run a few bioinformatics jobs on grid test computers at Indiana University Cntr. for Genomics and Bioinformatics. The "Test" button verifies that you are allowed to run a job on the computer, and that the program, e.g. ClustalW is available at the path specified.

This preliminary test comes with the Sample apps configured with a system path of '/bio/mb/bin', so that ClustalW should be located as /bio/mb/bin/clustalw. If you locate ClustalW in another folder, you can edit the environment settings for this job, using the Job menu File3/Edit Environment option. See also the Job Options/Edit command line option for another method to set the command-line for this job.

When the job is run, your selected data and program parameters are sent on to the program. If the program is on a remote computer, data are transported with grid transport protocols in a secure fashion to your remote account work folder. Status of the job is shown at the bottom of the job dialog, and when the job finishes, all resulting data files and job output are returned to your computer for display or futher analysis, as shown below for ClustalW.

These windows showing the data selection and job dialog for ClustalW are shown all together below.

For a more complex example, using the EMBOSS suite of bioinformatics programs, one can search the program directory for commands dealing with a topic of interest, such as sequence alignments, using a search filter "(info=*align*)". Those commands matching this search filter are listed, and can be selected then to run or learn more about from their help documentation, as shown below.

Author

Don Gilbert
Center for Genomics and Bioinformatics
Biology Department, Indiana University
Bloomington, Indiana, 47405 USA
[email protected]

BioGridRunner Bioinformatics Data Grid application