WHAT IS A DATABASE ANYWAY ? DATABASE management systems have been around since the early seventies. Only recently, however, have they started to become commonly available for microcomputers and personal computers. In this article, I shall set out to review what database management systems (DBMS) are, why they came into existence, their advantages over more conventional languages, and how they work. Much of the material in this article will directly relate to data-bases II, which is the most popular micro DBMS, although much of the reasoning applies equally well to FMS-80, CONDOR 20, MDBS, INFOSTAR and other systems. In order to understand the operation of database systems, it is necessary first of all to be fimilar with the underlying concepts of computer data files and methods of orginising data in those files. Data is simply raw information, It may consist of numbers, alphanumeric strings, codes or anything else that can be represented on some form of storage medium, be it tape, disk or magnetic bubbles. It is important to be aware that data has no meaning: reading a sector of a floppy disk might reveal the data "DHS,23.56,7691,47,105.54,4002.50". What does this mean? As it stands, nothing. It is only when the information is placed in context that it begins to make sense. For example, if we know that the sector is part of a file maintained by an invoice system, we might be able to deduce that the customer with the code DHS has a current balance of $23.56 before buying 47 of product code 7692, and so on. Notice that even this information is not especially significant. A manager using the computer system might ask "What is the trading history of DHS? What is their credit standing?" And the raw data is of no help in this regard. When data is placed in context, then it becomes information - it is informative, it tells us something. Raw data is meaningless. Files are simply collections of raw data, written onto a storage medium by some sort of program. In general, the program knows what a particular datum represents by virtue of its position in the file. The file does not contain information, like "The money owed to us by XYZ Company is $598.40". It does not even contain helpful reminders like "Balance = $598.40". It simply contains a sequence of raw data, like 341.87,610.00,598.40. Each section of the related information in the file is termed a record. Take a simple inventory file, for example. The data it would need to store might be product code, description, quantity in stock, back order quantity and re-order level. Let's first examine how this file might be created using Microsoft BASIC-80 : 100 REM INVENTORY FILE CREATE FILE 110 REM 120 REM VARIABLE DEFINITION 130 REM I$ = ID CODE D$ = DESCRIPTION 140 REM S$ = STOCK ON HAND B$ = BACK ORDER QUANTITY 150 REM R$ = REORDER QUANTITY 160 REM OPEN FILE 170 OPEN "R",1,"INVENT.DAT" 180 FIELD#1,5 AS I$, 30 AS D$, 2 AS S$,2 AS B$,2 AS R$ 190 INPUT "ID CODE";I1$ 200 IF I$ = "END" THEN 270 ELSE LET I$ =I1$ 210 INPUT "DESCRIPTION";D1$:LET D$ = D1$ 220 INPUT "STOCK ON HAND";S#:LET S$=MKI$(S$)) 230 INPUT "BACK ORDER QTY";B#:LET B$=MKI$(B#) 240 INPUT "RE-ORDER QUANTITY";R#:LET R#=MKI$(R#) 250 PUT 1 260 GOTO 190 270 PRINT "ENTRY COMPLETED" 280 CLOSE:END There are several points to note about this program. First is the definition of the variables; these definitions only apply within this program. To illustrate that point, here's a program which will print the contents of the same file, only it uses totally different variable names. 100 REM INVENTORY DUMP PROGRAM 110 REM 120 REM VARIABLE DEFINITIONS 130 REM A$ = ID CODE B$ = DESCRIPTION 140 REM C$ = STOCK ON HAND D$ = BACK ORDER QUANTITY 150 REM E$ = RE-ORDER QUANTITY 160 REM OPEN FILE 170 OPEN "R",1,"INVENT.DAT" 180 FIELD#1,5 A$, 30 AS B$, 2 AS C$, 2 AS D$, 2 AS E$ 190 FOR I = 1 TO LOF(1) 200 GET 1,I 210 F# = CVI(C$) 220 G# = CVI(D$) 230 H# = CVI(E$) 240 PRINT A$;" ";B$;F#;" ";G#;" "H# 250 NEXT I 260 PRINT "END OF FILE" 270 CLOSE: END In this case, rather than trying to make variable names in any way symbolic, I simply used letters from A to I. If you try this pair of programs, you'll see that this makes absolutely no difference - but it does make it difficult for a programmer, working on these programs, to remember what variable name refers to what item of data. Next, notice this line in both programs: 180 FIELD#1,5 A$, 30 AS B$, 2 AS C$, 2 AS D$, 2 AS E$ This line tells the program how long each item of data is: 5 bytes for the product ID code, 30 characters for the description, 2 bytes for each of the stock on hand, back order and re-order quantities. If the two programs had different FIELD statements, the result would be a disaster, as they would progressively get out of step and read the wrong data. It is the programmers responsibility to ensure that programs agree on the data layout of shared files. This information about file layout is not in the files themselves, but in the programs that use the files -- a key point. Each file consists of multiple records, and each record contains multiple fields, as defined by the FIELD statement. A record contains a a set of related data pertaining to one stock item, one transaction or whatever. Each field is one item of data within that record - an ID code, quantity, balance or other data. FILE ORGINISATION. All files are alike as far as these basic concepts go. However, as our needs become more realistic, it emerges that there are several ways of orginising files. The two basic types of file orginisation are sequential and direct. In sequential access, records are accessed in order, from the beginning of the file to the end. In order to make this form of access more acceptable, records are usually maintained in alphabetic or numeric order on a key such as surname or product ID. Because records follow in sequence, they can sometimes vary in length, so that each record starts where the previous one ends, typically erasing delimiters such as commas or carriage returns to mark the boundaries between each field of the records. In direct access files, the system is able to move directly to and read or write any record in the file. Because the computer's operating system must be able to calculate the position of the nth record, all records MUST be the same fixed length. Thus, if records are I bytes long, the nth record is positioned I*(n-1) bytes from the beginning of the file. Direct access files may or may not be maintained in alphabetical order. Here's a very simple example. Consider the problem of maintaining a set of debtor's accounts for a small company. Suppose we have 500 customers and that the most active of these places orders daily, while the quietist orders monthly. We need to be able to look up any customer's balance on demand. Here's one conceivable file orginisation using direct file access: Customer Name 25 bytes alphanumeric Street Address 25 " " Town 20 " " Postcode 4 " " Order 1 description 30 " " Order 1 amount 4 " floating point Order 2 description 30 " alphanumeric Order 2 amount 4 " floating point Order 3 description 30 " alphanumeric Order 3 amount 4 " floating point cont..> . Order 28 description 30 " alphanumeric Order 28 amount 4 " floating point ------------------------------------------------------------------- Total record length 1026 bytes times 500 accounts 501 Kbytes That's bigger than most 13cm floppy disks. But more importantly, well over half the space in the file is wasted. The vast majority of customers might place only 4 orders a month, while we have reserved places for 28. That means in many records 816 bytes were wasted. There is a better way!!! And here it is. Instead, we have only one file containing the name and address information for each customer, plus a pointer value, which we'll explain in a moment. Then we create a second file, which contains all the order information for all the customers. Each record contains the order description, and amount, plus another pointer. How do we tell which orders are for which customers? That's where the pointers come in. Now we have two files, orginised like this: Customer_File. Customer Name 25 bytes alphanumeric Street Address 25 " " Town 20 " " Postcode 4 " " Pointer 2 " binary word ---------------------------------------------------------------- Total Record Length 76 bytes times 500 38 Kbytes Order_File. Order 1 description 30 " " Order 1 amount 4 " floating point Pointer 2 " binary word ---------------------------------------------------------------- Total Record Length 36 bytes Times 5000 orders/month 180 Kbytes The total of the two is only 218 Kilobytes. Now look at Figure 1, which shows the way the pointers work. The pointer for the first customer's record is the number of his first order record in the order file. It's pointer, in turn, is set to the number of his second order, which in this case is the third record in the order file. It, in turn, chains to the fifth record of the file. A similar approach shows shows that the second customer has placed two orders, located in records 2 and 6 of the order file. Customer number three has also placed two orders. Now, while, at a stroke, we have saved 300 kilobytes of file space, we do have to pay a price. Firstly, although we can locate all orders placed by a given customer, we cannot find a customer who placed a particular order. Similarly, the pointers form a rather tenuous thread linking various parts of the file system altogether. If any of the pointers is accidently damaged, a part of the field is lost to us, and with this simple file structure, there is no way of restoring it short of recreating the file from scratch. In practice, a commercial system would get around both these objections by providing links in both directions, allowing tracing of a particular order to a customer, as well as providing redundant links which will enable reconstruction of any broken pointers. A modification of this technique allows each record to contain multiple pointers in what is called a tree structure. By using trees, it is possible to artificially overlay, say, alphabetic ordering on an otherwise random file. AT_LAST,_THE_DATABASE. Until the early seventies. most programs were written using these techniques. However,there are a number of problems associated with this way of doing things. First, data associated with a particular amount is scattered across several files, and often cannot be accessed without knowing the particular key (in the example above, the customer name). It is difficult to work out what data, where, relates to what else, and it is difficult to create a set of files and logical file structure which meets realistic needs and yet is simple enough to understand. Second, the easiest way to meet access requirements is often just to store the same data repeatedly in several files. The result is a lot of expensive disk storage. Third, the data formats used by different programming languages and utilities are often imcompatible, and as we have seen above, it is up to the programmer to remember what data goes where and in what format. Fourth, these kinds of large systems in the real world have a habit of growing like Topsy. As they do so, it becomes more and more difficult for the programmer to remember all the variable names and data formats and other peculiarities such as file relationships. And if a file structure is changed, then all lines, in all programs which use the file must also be changed - often a mammoth task. Finally, data in the files is completely raw and has no meaning; we cannot tell what each field represents - even if we can identify where a field begins and ends. The database management system provider a solution to the problem. A DBMS completely manages the storage of data, with no need for the user to know how it orginises it. The user needs only to know how many bytes are devoted to a particular field (although he can originally specify this) or where it is located in a record, He need only know that a field called name contains the customer name. This property is called data independence. If a user decides that he would prefer data presented in a different format, then a simple command to the DBMS will result in the data being reformatted. This is possible thorough a unique facility which is not provided in conventional program languages - that is, information about the record layout, including field names, types and sizes, is embedded in the database, and is available to all programs using it. This information may be stored in a separate file, as in FMS-80 or InfoStar, or may be stored in the first few hundred bytes of the database itself, as dBaseII does. The result is that the user (and his programs) need know nothing of the actual database layout, and changes to the database orginisation will not effect the running of the programs. Two distinct classes of database systems are available: those which simply provide a sophisticated file access method such as TOTAL on mainframes or MBDS on micros; and those which also incorporate a special language which may be used to manipulate the database, query it and provide reports - such as FMS-80, Condor 20 and dBaseII. Within each of these groups, there is a further subdivision into two major database models : heirachical and relational. The heirarchical model corresponds basically to the simple example provided in figure 1, where each order is part of a list which belongs to a master record. ______________________________________________________________________________ | NAME | STREET | TOWN | P/C |POINT| NAME | STREET | TOWN | P/C |POINT| NAME>> ------------------------------------------------------------------------------ ______<______________<_______| | | ________________<______________<____________| | | | | __<_________________________<______> ___ |___________________|__________________|___________________________________ |DESC||QTY|POINT|DESC||QTY|POINT|DESC||QTY|POINT|DESC||QTY|POINT|DESC||QTY|POI ------------------------------------------------------------------------------ |_>_________>____|___| | | | |____>___________>____| Record 1. | |_________>______________________>_to next desc> Record 2. Record 3. It suffers from the same mentioned above : that all orders for a particular customer are easy to find, but that all customers who have purchased a particular product are difficult to discover. Note, though, that if orders were chosen to be the master file which links to the customer file, the reverse would be true. This basic limitation means that this model is not good enough for representing one-to-many or many-to-many relationships. However, for many-to- one relationships (like many orders to one customer), it is obviously well suited. The relational model, despite the name, has nothing to do with the relationships between data items. In fact, the term relation is another mathematical name for a table. Consider the example below as a table : Cust No Name Street Town Postcode +-----------------------------------------------------------------+ | C1 | J. Brown | 10 Albert St | Mossman | 2088 | +-----------------------------------------------------------------+ | C2 | H. Smith | 114 Kirk Rd | Bankstown | 2200 | +-----------------------------------------------------------------+ | C3 | L. Wentworth | 67 Boston St | Sydney | 2000 | +-----------------------------------------------------------------+ | | | | | | Cust No Description Amount +---------------------------------+ | C1 | 20 Widgets | $85.00 | +---------------------------------+ | C2 | 45 Grommets | $1.35 | +---------------------------------+ | C1 | 30 Widgets | $130.00 | +---------------------------------+ | | | | +---------------------------------+ | | | | By creating two relations, we can view the columns as fields, and the rows as records. The records are accessed by key; in this case customer number. The relational model is purely an abstract mathematical concept, and in practice, relational databases have to rely upon pointers and indexes in order to physically construct the database. However, it would be possible to build a true relational database using a contents addressable memory, which, although it is horrendously expensive, is an approach being investigated by ICL for its next generation of mainframe computers. In the meantime, bear in mind that there is no such thing as a true relational database running on microcomputers - at best, a system will hide the pointers which make it a heirarchical database, but usually macro databases simply provide data independence, leaving it to the user to implement pointers as he would in any language, like the BASIC example above. Commercial DBMSs In this article, the discussion is confined to microcomputer software products, so I look elsewhere for coverage of mainframe products such as TOTAL or IBM's IMS and DL/1. Condor Series 20 is a relational database which is available in three levels. The full system offers a wide range of file management and transaction processing capabilities. Like most macro DBMSs, it includes a report generating package. FMS-80, from Systems Plus, would have to be regarded as a relational database system in comparison with other systems which make that claim, but it's name actually stands for File Management System, which is a much more realistic nomenclature. Primarily menu-driven, it incorporates a data manipulation language called EFM, which can be used to generate extremely sophisticated applications indeed. dBaseII is the most popular microcomputer DBMS. Based on a relational model, it features a range of transaction processing and reporting capabilities, and incorporates its own, nameless language. While data-bases incorporates a number of sophisticated concepts, it falls down in its actual implementation, particularly in regard to documentation, which simply omits many features of the system. Indeed, for a long time, I was convinced that I must be missing something about the package. My self confidence was recovered as people started pointing out the deficiencies of the documentation! MBDS is actually a superset of the heirarchical model - termed a network database - and is primarily an access method for use from within other programs. It is possibly the most sophisticated of the packages mentioned here, and is certainly the most complex, if the documentation is any indication. It is also the most expensive DBMS package available for micros. TIM III, from Innovative Software, runs on both IBM PC and CP/M-based machines. It is a menu driven system with extensive help screens, and is one of the simplest systems to use. Many other database systems are available, both on macros and mainframes. However the basic principles remain the same, and they all share two common concepts : The information about file structure is known to all programs that use the database, and the system provides a report generator program which can be used with virtually no programming knowledge. END PREFACE .