WHAT IS A DATABASE ANYWAY ?

     DATABASE management systems have been around since the early seventies. 
Only recently, however, have they started to become commonly available for 
microcomputers and personal computers.

     In this article, I shall set out to review what database management 
systems (DBMS) are, why they came into existence, their advantages over more 
conventional languages, and how they work.              

     Much of the material in this article will directly relate to data-bases 
II, which is the most popular micro DBMS, although much of the reasoning 
applies equally well to FMS-80, CONDOR 20, MDBS, INFOSTAR and other systems.

     In order to understand the operation of database systems, it is necessary 
first of all to be fimilar with the underlying concepts of computer data files 
and methods of orginising data in those files.

     Data is simply raw information, It may consist of numbers, alphanumeric 
strings, codes or anything else that can be represented on some form of 
storage medium, be it tape, disk or magnetic bubbles.

     It is important to be aware that data has no meaning: reading a sector of 
a floppy disk might reveal the data "DHS,23.56,7691,47,105.54,4002.50". What 
does this mean? As it stands, nothing. It is only when the information is 
placed in context that it begins to make sense. 

     For example, if we know that the sector is part of a file maintained by 
an invoice system, we might be able to deduce that the customer with the code 
DHS has a current balance of $23.56 before buying 47 of product code 7692, and 
so on.

     Notice that even this information is not especially significant. A 
manager using the computer system might ask "What is the trading history of 
DHS? What is their credit standing?" And the raw data is of no help in this 
regard. 

     When data is placed in context, then it becomes information - it is 
informative, it tells us something. Raw data is meaningless.

     Files are simply collections of raw data, written onto a storage medium 
by some sort of program. In general, the program knows what a particular datum 
represents by virtue of its position in the file. The file does not contain 
information, like "The money owed to us by XYZ Company is $598.40". It does 
not even contain helpful reminders like "Balance = $598.40". It simply 
contains a sequence of raw data, like 341.87,610.00,598.40.

     Each section of the related information in the file is termed a record. 
Take a simple inventory file, for example. The data it would need to store 
might be product code, description, quantity in stock, back order quantity and 
re-order level. Let's first examine how this file might be created using 
Microsoft BASIC-80 : 



100 REM INVENTORY FILE CREATE FILE
110 REM
120 REM VARIABLE DEFINITION
130 REM I$ = ID CODE                    D$ = DESCRIPTION
140 REM S$ = STOCK ON HAND              B$ = BACK ORDER QUANTITY
150 REM R$ = REORDER QUANTITY
160 REM OPEN FILE
170 OPEN "R",1,"INVENT.DAT"
180 FIELD#1,5 AS I$, 30 AS D$, 2 AS S$,2 AS B$,2 AS  R$
190 INPUT "ID CODE";I1$
200 IF I$ = "END" THEN 270 ELSE LET I$ =I1$
210 INPUT "DESCRIPTION";D1$:LET D$ = D1$
220 INPUT "STOCK ON HAND";S#:LET S$=MKI$(S$))
230 INPUT "BACK ORDER QTY";B#:LET B$=MKI$(B#)
240 INPUT "RE-ORDER QUANTITY";R#:LET R#=MKI$(R#)
250 PUT 1
260 GOTO 190
270 PRINT "ENTRY COMPLETED"
280 CLOSE:END

          There are several points to note about this program. First is the 
definition of the variables; these definitions only apply within this program. 
To illustrate that point, here's a program which will print the contents of 
the same file, only it uses totally different variable names.

100 REM INVENTORY DUMP PROGRAM
110 REM
120 REM VARIABLE DEFINITIONS
130 REM A$ = ID CODE                    B$ = DESCRIPTION
140 REM C$ = STOCK ON HAND              D$ = BACK ORDER QUANTITY
150 REM E$ = RE-ORDER QUANTITY
160 REM OPEN FILE
170 OPEN "R",1,"INVENT.DAT"
180 FIELD#1,5 A$, 30 AS B$, 2 AS C$, 2 AS D$, 2 AS E$
190 FOR I = 1 TO LOF(1)
200   GET 1,I
210   F# = CVI(C$)
220   G# = CVI(D$)
230   H# = CVI(E$)
240   PRINT A$;"  ";B$;F#;"  ";G#;"  "H#
250 NEXT I
260 PRINT "END OF FILE"
270 CLOSE: END

     In this case, rather than trying to make  variable names in any way 
symbolic, I simply used letters from A to I. If you try this pair of programs, 
you'll see that this makes absolutely no difference - but it does make it 
difficult for a programmer, working on these programs, to remember what 
variable name refers to what item of data.

     
     Next, notice this line in both programs:
180 FIELD#1,5 A$, 30 AS B$, 2 AS C$, 2 AS D$, 2 AS E$

     
     This line tells the program how long each item of data is: 5 bytes for 
the product ID code, 30 characters for the description, 2 bytes for each of 
the stock on hand, back order and re-order quantities.

     If the two programs had different FIELD statements, the result would be a 
disaster, as they would progressively get out of step and read the wrong data. 
It is the programmers responsibility to ensure that programs agree on the data 
layout of shared files. This information about file layout is not in the files 
themselves, but in the programs that use the files -- a key point.

     Each file consists of multiple records, and each record contains multiple 
fields, as defined by the FIELD statement. A record contains a a set of 
related data pertaining to one stock item, one transaction or whatever. Each 
field is one item of data within that record - an ID code, quantity, balance 
or other data.

          FILE ORGINISATION.

     All files are alike as far as these basic concepts go. However, as our 
needs become more realistic, it emerges that there are several ways of 
orginising files. The two basic types of file orginisation are sequential and 
direct.
     
     In sequential access, records are accessed in order, from the beginning 
of the file to the end. In order to make this form of access more acceptable, 
records are usually maintained in alphabetic or numeric order on a key such as 
surname or product ID. Because records follow in sequence, they can sometimes 
vary in length, so that each record starts where the previous one ends, 
typically erasing delimiters such as commas or carriage returns to mark the 
boundaries between each field of the records.

     In direct access files, the system is able to move directly to and read 
or write any record in the file. Because the computer's operating system must 
be able to calculate the position of the nth record, all records MUST be the 
same fixed length. Thus, if records are I bytes long, the nth record is 
positioned    I*(n-1) bytes from the beginning of the file. Direct access 
files may or may not be maintained in alphabetical order.

     Here's a very simple example. Consider the problem of maintaining a set 
of debtor's accounts for a small company. Suppose we have 500 customers and 
that the most active of these places orders daily, while the quietist orders 
monthly. We need to be able to look up any customer's balance on demand. 

Here's one conceivable file orginisation using direct file access:

Customer Name                           25 bytes alphanumeric
Street Address                          25   "        "
Town                                    20   "        "
Postcode                                4    "        "
Order 1 description                     30   "        "
Order 1 amount                          4    "    floating point
Order 2 description                     30   "    alphanumeric
Order 2 amount                          4    "    floating point
Order 3 description                     30   "    alphanumeric
Order 3 amount                          4    "    floating point       cont..>
     .
Order 28 description                    30   "    alphanumeric
Order 28 amount                         4    "    floating point
-------------------------------------------------------------------
Total record length                     1026 bytes
times 500 accounts                      501 Kbytes

     That's bigger than most 13cm floppy disks. But more importantly, well 
over half the space in the file is wasted. The vast majority of customers 
might place only 4 orders a month, while we have reserved places for 28. That 
means in many records 816 bytes were wasted. There is a better way!!!

     And here it is. Instead, we have only one file containing the name and 
address information for each customer, plus a pointer value, which we'll 
explain in a moment. Then we create a second file, which contains all the 
order information for all the customers. Each record contains the order 
description, and amount, plus another pointer. How do we tell which orders are 
for which customers? That's where the pointers come in.

     Now we have two files, orginised like this:

Customer_File.

Customer Name                           25 bytes alphanumeric
Street Address                          25   "        "
Town                                    20   "        "
Postcode                                4    "        "
Pointer                                 2    "   binary word
----------------------------------------------------------------
Total Record Length                     76 bytes
times 500                               38 Kbytes

Order_File.

Order 1 description                     30   "        "
Order 1 amount                          4    "    floating point
Pointer                                 2    "    binary word
----------------------------------------------------------------
Total Record Length                     36 bytes
Times 5000 orders/month                180 Kbytes


     The total of the two is only 218 Kilobytes. Now look at Figure 1, which 
shows the way the pointers work. The pointer for the first customer's record 
is the number of his first order record in the order file. It's pointer, in 
turn, is set to the number of his second order, which in this case is the 
third record in the order file. It, in turn, chains to the fifth record of the 
file.

     A similar approach shows shows that the second customer has placed two 
orders, located in records 2 and 6 of the order file. Customer number three 
has also placed two orders.



     Now, while, at a stroke, we have saved 300 kilobytes of file space, we do 
have to pay a price. Firstly, although we can locate all orders placed by a  
given customer, we cannot find a customer who placed a particular order.

     Similarly, the pointers form a rather tenuous thread linking various 
parts of the file system altogether. If any of the pointers is accidently 
damaged, a part of the field is lost to us, and with this simple file 
structure, there is no way of restoring it short of recreating the file from 
scratch.

     In practice, a commercial system would get around both these objections 
by providing links in both directions, allowing tracing of a particular order 
to a customer, as well as providing redundant links which will enable 
reconstruction of any broken pointers.


     A modification of this technique allows each record to contain multiple 
pointers in what is called a tree structure. By using trees, it is possible to 
artificially overlay, say, alphabetic ordering on an otherwise random file.

AT_LAST,_THE_DATABASE.

     Until the early seventies. most programs were written using these 
techniques. However,there are a number of problems associated with this way of 
doing things.
     
     First, data associated with a particular amount is scattered across 
several files, and often cannot be accessed without knowing the particular key 
(in the example above, the customer name). It is difficult to work out what 
data, where, relates to what else, and it is difficult to create a set of 
files and logical file structure which meets realistic needs and yet is simple 
enough to understand.

     Second, the easiest way to meet access requirements is often just to 
store the same data repeatedly in several files. The result is a lot of 
expensive disk storage.

     Third, the data formats used by different programming languages and 
utilities are often imcompatible, and as we have seen above, it is up to the 
programmer to remember what data goes where and in what format.

     Fourth, these kinds of large systems in the real world have a habit of 
growing like Topsy. As they do so, it becomes more and more difficult for the 
programmer to remember all the variable names and data formats and other 
peculiarities such as file relationships. And if a file structure is changed, 
then all lines, in all programs which use the file must also be changed - 
often a mammoth task.

     Finally, data in the files is  completely raw and has no meaning; we 
cannot tell what each field represents - even if we can identify where a field 
begins and ends. The database management system provider a solution to the 
problem.



     A DBMS completely manages the storage of data, with no need for the user 
to know how it orginises it. The user needs only to know how many bytes are 
devoted to a particular field (although he can originally specify this) or 
where it is located in a record, He need only know that a field called name 
contains the customer name.

     This property is called data independence. If a user decides that he 
would prefer data presented in a different format, then a simple command to 
the DBMS will result in the data being reformatted. This is possible thorough 
a unique facility which is not provided in conventional program languages - 
that is, information about the record layout, including field names, types and 
sizes, is embedded in the database, and is available to all programs using it.

     This information may be stored in a separate file, as in FMS-80 or 
InfoStar, or may be stored in the first few hundred bytes of the database 
itself, as dBaseII does. The result is that the user (and his programs) need 
know nothing of the actual database layout, and changes to the database 
orginisation will not effect the running of the programs.

     Two distinct classes of database systems are available: those which 
simply provide a sophisticated file access method such as TOTAL on mainframes 
or MBDS on micros; and those which also incorporate a special language which 
may be used to manipulate the database, query it and provide reports - such as 
FMS-80, Condor 20 and dBaseII.

     Within each of these groups, there is a further subdivision into two 
major database models : heirachical and relational. The heirarchical model 
corresponds basically to the simple example provided in figure 1, where each 
order is part of a list which belongs to a master record.



______________________________________________________________________________
|  NAME | STREET | TOWN | P/C |POINT| NAME | STREET | TOWN | P/C |POINT| 
NAME>>
------------------------------------------------------------------------------
    ______<______________<_______|                                  |
    |                   ________________<______________<____________|
    |                   |
    |                   |                  
__<_________________________<______>
___ 
|___________________|__________________|___________________________________
|DESC||QTY|POINT|DESC||QTY|POINT|DESC||QTY|POINT|DESC||QTY|POINT|DESC||QTY|POI
------------------------------------------------------------------------------
             |_>_________>____|___|          |                     |
                              |              |____>___________>____|
     Record 1.                |                                     
                              |_________>______________________>_to next desc>
                Record 2.
                                 Record 3.




     It suffers from the same mentioned above : that all orders for a 
particular customer are easy to find, but that all customers who have 
purchased a particular product are difficult to discover. Note, though, that 
if orders were chosen to be the master file which links to the customer file, 
the reverse would be true.

     This basic limitation means that this model is not good enough for 
representing one-to-many or many-to-many relationships. However, for many-to-
one relationships (like many orders to one customer), it is obviously well 
suited.

     The relational model, despite the name, has nothing to do with the 
relationships between data items. In fact, the term relation is another 
mathematical name for a table. Consider the example below as a table :

 Cust No   Name          Street              Town        Postcode
+-----------------------------------------------------------------+
| C1     | J. Brown     | 10 Albert St      | Mossman     | 2088   |
+-----------------------------------------------------------------+
| C2     | H. Smith     | 114 Kirk Rd       | Bankstown  | 2200   |
+-----------------------------------------------------------------+
| C3     | L. Wentworth | 67 Boston St      | Sydney     | 2000   |
+-----------------------------------------------------------------+
|        |              |                   |            |        |
Cust No   Description    Amount
+---------------------------------+
| C1     | 20 Widgets   | $85.00  |
+---------------------------------+
| C2     | 45 Grommets  | $1.35   |
+---------------------------------+
| C1     | 30 Widgets   | $130.00 |
+---------------------------------+
|        |              |         |
+---------------------------------+
|        |              |         |


     By creating two relations, we can view the columns as fields, and the 
rows as records. The records are accessed by key; in this case customer 
number. The relational model is purely an abstract mathematical concept, and 
in practice, relational databases have to rely upon pointers and indexes in 
order to physically construct the database. However, it would be possible to 
build a true relational database using a contents addressable memory, which, 
although it is horrendously expensive, is an approach being investigated by 
ICL for its next generation of mainframe computers.
     
     In the meantime, bear in mind that there is no such thing as a true 
relational database running on microcomputers - at best, a system will hide 
the pointers which make it a heirarchical database, but usually macro 
databases simply provide data independence, leaving it to the user to 
implement pointers as he would in any language, like the BASIC example above.




Commercial DBMSs

     In this article, the discussion is confined to microcomputer software 
products, so I look elsewhere for coverage of mainframe products such as TOTAL 
or IBM's IMS and DL/1.

     Condor Series 20 is a relational database which is available in three 
levels. The full system offers a wide range of file management and transaction 
processing capabilities. Like most macro DBMSs, it includes a report 
generating package.


     FMS-80, from Systems Plus, would have to be regarded as a relational 
database system in comparison with other systems which make that claim, but 
it's name actually stands for File Management System, which is a much more 
realistic nomenclature. Primarily menu-driven, it incorporates a data 
manipulation language called EFM, which can be used to generate extremely 
sophisticated applications indeed.

     dBaseII is the most popular microcomputer DBMS. Based on a relational 
model, it features a range of transaction processing and reporting 
capabilities, and incorporates its own, nameless language. While data-bases 
incorporates a number of sophisticated concepts, it falls down in its actual 
implementation, particularly in regard to documentation, which simply omits 
many features of the system. Indeed, for a long time, I was convinced that I 
must be missing something about the package. My self confidence was recovered 
as people started pointing out the deficiencies of the documentation!

     MBDS is actually a superset of the heirarchical model - termed a network 
database - and is primarily an access method for use from within other 
programs. It is possibly the most sophisticated of the packages mentioned 
here, and is certainly the most complex, if the documentation is any 
indication. It is also the most expensive DBMS package available for micros.

     TIM III, from Innovative Software, runs on both IBM PC and CP/M-based 
machines. It is a menu driven system with extensive help screens, and is one 
of the simplest systems to use.

     Many other database systems are available, both on macros and mainframes. 
However the basic principles remain the same, and they all share two common 
concepts : The information about file structure is known to all programs that 
use the database, and the system provides a report generator program which can 
be used with virtually no programming knowledge.



                               END      PREFACE
               
.