----------------------------- INSIDE TURBO PASCAL 5.5 UNITS ----------------------------- by William L. Peavy ----------------- Revised: August 11, 1990 ABSTRACT This document provides a revised report on researches into the structure and content of Unit (.TPU) files produced by Turbo Pascal (version 5.5) from Borland International. No assurances are possible regarding when (if ever) further updates will be available so the material is released to the Turbo Pascal user community in its admittedly imcomplete state since very little of consequence really remains to be done. COMMENTS Comments and feed-back are welcome -- especially new contributions. I can be reached via the following services: CompuServ (70042,2310) HalPC Telecom-1 (William;Peavy) HalPC Telecom-2 (Wm;Peavy) Table Of Contents Introduction ................................................ 3 1. Gross File Structure ..................................... 3 1.1 User Units .......................................... 4 2. Locators ................................................. 5 2.1 Local Links ......................................... 5 2.2 Global Links ........................................ 5 2.3 Table Offsets ....................................... 5 3. Unit Header .............................................. 6 3.1 Description ......................................... 6 3.2 File Size ........................................... 9 4. Symbol Dictionaries ...................................... 9 4.1 Organization ........................................ 9 4.2 Interface Dictionary ............................... 10 4.3 DEBUG Dictionary ................................... 10 4.4 Dictionary Elements ................................ 10 4.4.1 Hash Tables .................................. 10 4.4.1.1 Size ................................... 11 4.4.1.2 Scope .................................. 12 4.4.1.3 Special Cases .......................... 12 4.4.2 Dictionary Headers ........................... 13 4.4.3 Dictionary Stubs ............................. 13 4.4.3.1 Label Declaratives ("O") ............... 13 4.4.3.2 Un-Typed Constants ("P") ............... 14 4.4.3.3 Named Types ("Q") ...................... 14 4.4.3.4 Variables, Fields, Typed Cons ("R") .... 15 4.4.3.5 Subprograms & Methods ("S") ............ 16 4.4.3.6 Turbo Std Procedures ("T") ............. 17 4.4.3.7 Turbo Std Functions ("U") .............. 17 4.4.3.8 Turbo Std "NEW" Routine ("V") .......... 17 4.4.3.9 Turbo Std Port Arrays ("W") ............ 17 4.4.3.10 Turbo Std External Variables ("X") .... 17 4.4.3.11 Units ("Y") ........................... 18 4.4.4 Type Descriptors ............................. 19 4.4.4.1 Scope .................................. 19 4.4.4.2 Prefix Part ............................ 20 4.4.4.3 Suffix Parts ........................... 21 4.4.4.3.1 Un-Typed ......................... 21 4.4.4.3.2 Structured Types ................. 22 4.4.4.3.2.1 ARRAY Types ................ 22 4.4.4.3.2.2 RECORD Types ............... 22 4.4.4.3.2.3 OBJECT Types ............... 23 4.4.4.3.2.4 FILE (non-TEXT) Types ...... 23 4.4.4.3.2.5 TEXT File Types ............ 23 4.4.4.3.2.6 SET Types .................. 24 - i - Table Of Contents 4.4.4.3.2.7 POINTER Types .............. 24 4.4.4.3.2.8 STRING Types ............... 24 4.4.4.3.3 Floating-Point Types ............. 24 4.4.4.3.4 Ordinal Types .................... 24 4.4.4.3.4.1 "Integers" ................. 25 4.4.4.3.4.2 BOOLEANs ................... 25 4.4.4.3.4.3 CHARs ...................... 25 4.4.4.3.4.4 ENUMERATions ............... 26 4.4.4.3.5 SUBPROGRAM Types ................. 26 5. Maps and Lists .......................................... 27 5.1 PROC Map ........................................... 27 5.2 CSeg Map ........................................... 28 5.3 Typed CONST DSeg Map ............................... 28 5.4 Global VAR DSeg Map ................................ 29 5.5 Donor Unit List .................................... 29 5.6 Source File List ................................... 30 5.7 DEBUG Trace Table .................................. 31 6. Code, Data, Relocation Info ............................. 32 6.1 Object CSegs ....................................... 32 6.2 CONST DSegs ........................................ 32 6.3 Relocation Data Table .............................. 33 7. Supplied Program ........................................ 34 7.1 TPUNEW ............................................. 35 | 7.2 TPURPT1 ............................................ 35 7.3 TPUAMS1 ............................................ 35 7.4 TPUUNA1 ............................................ 35 7.5 Modifications ...................................... 36 7.6 Notes on Program Logic ............................. 36 | 7.6.1 Formatting the Dictionary .................... 37 | 7.6.2 The Disassembler ............................. 38 | 8. Unit Libraries .......................................... 41 8.1 Library Structure .................................. 41 8.2 The TPUMOVER Utility ............................... 41 9. Application Notes ....................................... 41 10. Acknowledgements ....................................... 42 11. References ............................................. 43 - ii - Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- INTRODUCTION This document is the outcome of an inquiry conducted into the structure and content of Borland Turbo Pascal (Version 5.5) Unit files. The original purpose of the inquiry was to provide a body of theory enabling Cross-Reference programs to resolve references to symbols defined in .TPU files where qualification was not explicitly provided. As is so often the case, one thing led to another and the scope of the inquiry was expanded dramatically. While this document should not be regarded as definitive, the author feels that the entire Turbo Pascal User community might gain from the information extracted from these files at the cost of so much time and effort. The material contained herein represents the findings and interpretations of the author. A great deal of guess-work was required and no assurances are given as to the accuracy of either the findings of fact or the inferences contained herein which are the sole work-product of the author. In particular, the author had access only to materials or information that any normal Borland customer has access to. Further, no Borland source-codes were available as the Library Routine source is not licensed to the author. In short, there was nothing irregular about how these findings were achieved. The material contained herein is placed in the public domain free of copyright for use of the general public at its own risk. The author assumes no liability for any damages arising from the use of this material by others. If you make use of this information and you get burned, TOUGH! The author accepts no obligation to correct any such errors as may exist in the supplied programs or in the findings of fact or opinion contained herein. On the other hand, this is not a "complete" work in that a great many questions remain open, especially as regards fine details. (The author is not a practitioner of Intel 80xxx Assembly Language and several open questions might best be addressed by persons competent in this area.) The author welcomes the input of interested readers who might be able to "flesh-out" some of these open questions with "hard" answers. 1. GROSS FILE STRUCTURE A Turbo Pascal Unit file (Version 5.5 only) consists of an array of bytes that is some exact multiple of sixteen (16). "Signature" information allows the compiler to verify that the .TPU file was compiled with the correct compiler version and to verify that the file is of the correct size. The fine structure of the file will be addressed in later sections at ever increasing levels of detail. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 3 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- Graphically, the file may be regarded as having the following general layout: +-------------------+ | Unit Header | Main Index to Unit File +-------------------+ | Dictionaries: | | a) Interface | | b) Debugger * | For Local Symbol Access +-------------------+ | PROC Map | +-------------------+ | CSeg Map * | May be Empty +-------------------+ | CONST DSeg Map * | May be Empty +-------------------+ | VAR DSeg Map * | May be Empty +-------------------+ | Donor Units * | May be Empty +-------------------+ | Source Files | +-------------------+ | Trace Table * | May be Empty +-------------------+ | CODE Segment(s) * | May be Empty +-------------------+ | DATA Segment(s) * | May be Empty +-------------------+ | RELO Data * | May be Empty +-------------------+ 1.1 USER UNITS Units prepared by the compiler available to ordinary users have a very straight-forward appearance and content. There may even be a little "wasted" space that might be removed if the compiler were just a little cleverer. The SYSTEM.TPU file is quite another thing however. The SYSTEM.TPU file (found in TURBO.TPL) is extraordinary in that great pains seem to have been taken to compact it. Further, it contains a great many types of entries that just don't seem to be achievable by ordinary users and I suspect that much (if not all) of it was "hand-coded" in Assembler Language. In the following sections, the details of these optimizations will be explained in the context of the structural element then under discussion. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 4 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 2. LOCATORS The data in these files has need of structure and organization to support efficient access by the various programs such as the compiler, the linker and the debugger. This organization is built on a solid foundation of locators employed in the unit's data structures. 2.1 LOCAL LINKS Local Links (LL's) are items of type WORD (2 bytes) which contain an offset which is relative to the origin of the unit file itself. This implies that a unit must be somewhat less than 64K bytes in size. If the .TPU file is loaded into the heap, then LL's can be used to locate any byte in the segment beginning with the load point of the file. 2.2 GLOBAL LINKS Global Links (LG's) are used to locate type descriptors which may reside in other Units (i.e., units external to the present unit). LG's are structured items consisting of two (2) words. The first of these is an LL that is relative to the origin of the (possibly) external unit. The second word is an LL which locates the stub of the unit entry in the current unit dictionary for the (possibly) external unit. This dictionary entry provides the name of the unit that contains the item the LG points to. This provides a handy mechanism for locating type descriptors which are defined in other separately compiled units. 2.3 TABLE OFFSETS Finally, various data-structures within a .TPU file are organized as arrays of fixed-length records or as lists of variable-length records. Efficient access to such records is achieved by means of offsets rather than subscripts (an addressing technique denied Pascal). These offsets are relative to the origin of the array or list being referenced rather than the origin of the unit. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 5 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 3. UNIT HEADER The Unit Header comprises the first 64 bytes of the .TPU file. It contains LL's that effectively locate all other sections of the .TPU file plus statistics that enable a little cross-checking to be performed. Some parts of the Unit Header appear to be reserved for future use since no unit examined by this author has ever contained non-zero data in these apparently reserved fields. 3.1 DESCRIPTION The Unit Header provides a high-level locator table whereby each major structure in the unit file can be addressed. The following provides a Pascal-like explanation of the layout of the header followed by further narrative discussion of the contents of the individual fields in the Unit Header. Type HdrAry = Array[0..3] of Char; LL = Word; UnitHeader = Record FilHd : HdrAry; { +00 : = 'TPU6' } Fillr : HdrAry; { +04 : = $00000000 } UDirE : LL; { +08 : to Dictionary Head-This Unit } UGHsh : LL; { +0A : to Interface Hash Header } UHPrc : LL; { +0C : to PROC Map } UHCsg : LL; { +0E : to CSeg Map } UHDsT : LL; { +10 : to DSeg Map-Typed CONST's } UHDsV : LL; { +12 : to DSeg Map-GLOBAL Variables } URULt : LL; { +14 : to Donor Unit List } USRCF : LL; { +16 : to Source file List } UDBTS : LL; { +18 : to Debug Trace Step Controls } UndNC : LL; { +1A : to end non-code part of Unit } ULCod : Word; { +1C : Size of Code } ULTCon: Word; { +1E : Size of Typed Constant Data } ULPtch: Word; { +20 : Size of Relo Patch List } Unknx : Word; { +22 : Number of Virtual Objects??? } ULVars: Word; { +24 : Size of GLOBAL VAR Data } UHash2: LL; { +26 : to Debug Hash Header } UOvrly: Word; { +28 : Number of Procs to Overlay?? } UVTPad: Array[0..10] of Word; { +2A : Reserved for Future Expansion? } End; { UnitHeader } FilHd contains the characters "TPU6" in that order. This is clear evidence that this unit was compiled by Turbo Pascal Version 5.5. Fillr is apparently reserved and contains binary zeros. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 6 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- UDirE contains an LL (WORD) which points to the Dictionary Header in which the name of this unit is found. UGHsh contains an LL (WORD) which points to a Hash table that is the root of the Interface Dictionary tree. UHPrc contains an LL (WORD) which points to the PROC Map for this unit. The PROC Map contains an entry for each Procedure or Function declared in the unit (except for INLINE types), plus an entry for the Unit Initialization section. The length of the PROC Map (in bytes) is determined by subtracting this LL (at 000C) from the LL at offset 000E. UHCsg contains an LL (WORD) which points to the CSeg (CODE Segment) Map for this unit. The CSeg Map contains an entry for each CODE Segment produced by the compiler plus an entry for each of the CODE Segments included via the {$L filename.OBJ} compiler directive. The length of this Map (in bytes) is obtained by subtracting this LL (at 000E) from the word at 0010. The result may be zero in which case the CSeg Map is empty. UHDsT contains an LL (WORD) which points to the DSeg (DATA Segment) Map that maps the initializing data for Typed CONST items plus templates for VMT's (Virtual Method Tables) that are associated with OBJECTS which employ Virtual Methods. The length of this Map (in bytes) is obtained by subtracting this LL (at 0010) from the word at 0012. The result may be zero in which case this DSeg Map is empty. UHDsV contains an LL (WORD) which points to the DSeg (DATA Segment) Map that contains the specifications for DSeg storage required by VARiables whose scope is GLOBAL. The length of this Map (in bytes) is obtained by subtracting this LL (at 0012) from the word at 0014. The result may be zero in which case this DSeg Map is empty. URULt contains an LL (WORD) which points to a table of units which contribute either CODE or DATA Segments to the .EXE file for a program using this Unit. This is called the "Donor Unit Table". The length of this table (in bytes) is obtained by subtracting this LL (at 0014) from the word at 0016. The result may be zero in which case this table is empty. USRCF contains an LL (WORD) which points to a list of "source" files. These are the files whose CODE or DATA Segments are included in this Unit by the compiler. Examples are the Pascal Source for the Unit itself, plus the .OBJ files included via the {$L filename.OBJ} compiler directive. The length of this table (in bytes) is obtained by subtracting this LL (at 0016) from the word at 0018. The result may be zero in which case this table is empty. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 7 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- UDBTS contains an LL (WORD) which points to a Trace Table used by the DEBUGGER for "stepping" through a Function or Procedure contained in this Unit. The length of this table (in bytes) is obtained by subtracting this LL (at 0018) from the word at 001A. The result may be zero in which case this table is empty. UndNC contains an LL (WORD) which points to the first free byte which follows the Trace Table (if any). It serves as a delimiter for determinimg the size of the Trace Table. This LL (when rounded up to the next integral multiple of 16) serves to locate the start of the code/data segments. ULCod is a WORD that contains the total byte count of all CODE Segments compiled into this Unit. ULTCon is a WORD that contains the total byte count of all Typed CONST and VMT DATA Segments compiled into this unit. ULPtch is a WORD that contains the total byte count of the Relocation Data Table for this unit. Unknx is a WORD whose usage is poorly understood. It appears always to be zero except when the Unit contains OBJECTs which employ Virtual Methods. ULVars is a WORD that contains the total byte count of all GLOBAL VAR DATA Segments compiled into this unit. UHash2 contains an LL (WORD) which points to a Hash Table which is the root of the DEBUGGER Dictionary. If Local Symbols were generated by the compiler (directive {$L+}) then ALL symbols declared in the unit can be accessed from this Hash Table. In the SYSTEM.TPU file, there is no such Dictionary and the LL stored here points to the INTERFACE Dictionary. This is an example of Hash Table "Folding" to save space which has been observed only in SYSTEM.TPU. UOvrly is a WORD whose usage is poorly understood. This word is usually zero unless the Unit was compiled with the Overlay Directive {$O+}. UVTPad begins a series of eleven (11) words that are apparently reserved for future use. Nothing but zeros have ever been seen here by this author. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 8 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 3.2 FILE SIZE An independent check on the size of the .TPU file is available using information contained in the Unit Header. This is also important for .TPL (Unit Library) organization. To compute the file size, refer to the four (4) words at offsets 001A, 001C, 001E and 0020. Round the contents of each of these words to the lowest multiple of 16 that is greater than or equal to the content of that word. Then form the sum of the rounded words. This is the .TPU file size in bytes. 4. SYMBOL DICTIONARIES This area contains all available documentation of declared symbols and procedure blocks defined within the unit. Depending on compiler options in effect when the unit was compiled, this section will contain at a minimum, the INTERFACE declarations, and at a maximum, ALL declarations. The information stored in the dictionary is highly dependent on the context of the symbol declared. We defer further explanation to the appropriate section which follows. 4.1 ORGANIZATION The dictionary is organized with a Hash Table as its root. The hash table is used to provide rapid access to arbitrary symbols. Since Turbo Pascal compiles very rapidly, I presume the hash function to be worthwhile to say the least. The dictionary itself may be thought of as an n-way tree. Each subtree has its roots in a hash table. There may be a great many hash tables in a given unit and their number depends on unit complexity as well as the options chosen when the unit was compiled. Use of the {$L+} directive produces the densest trees. The hash tables are explained in detail a few sections further on. Hash tables point to Dictionary Headers. When two or more symbols produce the same hash function result, a collision is said to occur. Collisions are resolved by the time-honored method of chaining together the Dictionary Headers of those symbols having the same hash function result. Dictionary supersetting is accomplished using these chains. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 9 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.2 INTERFACE DICTIONARY The INTERFACE dictionary contains all symbols and the necessary explanatory data for the INTERFACE section of a Unit. Symbols get added to the Unit using increasing storage addresses until the IMPLEMENTATION section is encountered. 4.3 DEBUG DICTIONARY The DEBUG dictionary (if present) is a superset of the INTERFACE dictionary. It is used by the Turbo Debugger to support its many features when tracing through a unit. If present, this dictionary is rooted in its own hash table. The hash table is effectively initialized when the IMPLEMENTATION keyword is processed by the compiler. This takes the form (initially) of an unmodified copy of the INTERFACE hash table, to which symbols are added in the usual fashion. Thus, the hash chains constructed or extended at this time lead naturally to the INTERFACE chains and this is how the superset is effectively implemented. 4.4 DICTIONARY ELEMENTS The dictionary contains four major elements. These are: hash tables, Dictionary Headers, Dictionary Stubs and Type Descriptors. The distinction between Dictionary Headers and Stubs is essentially arbitrary and is made in this document to assist in exposition. They might just as easily be regarded as a single element (such as symbol entry). 4.4.1 HASH TABLES As has been intimated, Hash Tables are the glue that binds the dictionary entries together and gives the dictionary its "shape". They effectively implement the scope rules of the language and speed access to essential information. Each Hash table begins with a 2-byte size descriptor. This descriptor contains the number of bytes in the table proper (less 2). Thus, the descriptor directly points to the last bucket in the hash table. For a hash table of 128 bytes, the size descriptor contains 126. The first bucket in the table immediately follows the size descriptor. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 10 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.4.1.1 SIZE So far, three different hash table sizes have been observed. The INTERFACE and DEBUG hash tables are usually 128 bytes (64 entries) in size plus 2 bytes of size description, but the SYSTEM.TPU unit is a special case, containing only 16 entries. Hash tables which anchor subtrees whose scope is relatively local usually contain four (4) entries (8 bytes). Graphically, a Hash Table with four slots has the following layout: +--------------------+ | 0006h | Size Descriptor +====================+ | slot 0 | an LL or zero +--------------------+ | slot 1 | an LL or zero +--------------------+ | slot 2 | an LL or zero +--------------------+ | slot 3 | an LL or zero +--------------------+ It should be noted that the Size Descriptor furnishes an upper bound for the hash function itself. Thus, it seems possible that a single hash function is used for all hash tables and that its result is ANDed with the Size Descriptor to get the final result. Because the sizes are chosen as they are (powers of 2) this is feasible. Note that in the above example, 6 = 2 * (n - 1) where n = 4 {slot count}. All of the hash tables observed so far have this property. What you get is a really efficient MOD function. Suppose that the hash of a given symbol is 13 and the proper slot must be located for a hash table of four entries. If we let "h" be the raw result of 13, then our final hash is (h SHL 1) AND ((4-1) SHL 1) or (13 SHL 1) AND 6 = 2 ! One final note on this subject. Given these properties, "Folding" of sparse hash tables is a rather trivial exercise so long as the new hash table also contains a number of slots that is a power of 2. This point is intriguing when one recalls that the SYSTEM.TPU hash table has only 16 slots rather than the usual 64. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 11 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.4.1.2 SCOPE The INTERFACE and DEBUG dictionary hash tables are Global in Scope even though the symbols accessed directly via the DEBUG hash table may be private. On the other hand, other hash tables are purely local in scope. For example, the fields declared within a record are reached via a small local hash table, as are the parameters and local variables declared within procedures and functions. Even OBJECTS use this technique to provide access to Methods and Object Fields. Access to such local scope fields/methods requires use of qualified names which ensures conformity to Pascal scope rules. The method is truly simple and elegant. 4.4.1.3 SPECIAL CASES The SYSTEM.TPU Unit is a special case. Its INTERFACE and DEBUG hash tables have apparently been "hand-tuned" for small size. Each contains only sixteen (16) entries. In addition, the DEBUG hash table is empty since there is no local symbol generation in this unit. Therefore, the DEBUG hash table does not exist as a separate entity, its function being served by the INTERFACE hash table. The pointer to the DEBUG hash table (in the Unit Header) has the same value as the pointer to the INTERFACE hash table (SYSTEM unit ONLY). ---------------------------------------------------------------------- Rev: August 11, 1990 Page 12 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.4.2 DICTIONARY HEADERS This is the structure that anchors all information known by the compiler about any symbol. The format is as follows: +00: An LL which points to the next (previous) symbol in the same scope which had the same hash function value. +02: A character that defines the category the symbol belongs to and defines the format of the Dictionary Stub which follows the Dictionary Header. +03: A String (in the Pascal sense) of variable size that contains the text of the symbol (in UPPER-CASE letters only). The SizeOf function is not defined for these strings since they are truncated to match the symbol size. The "value" of the SizeOf function can be determined by adding 1 to the first byte in the string. Thus, Ord(Symbol[0])+1 is the expression that defines the Size of the symbol string. Turbo Pascal defines a symbol as a string of relatively arbitrary size, the most significant 63 characters of which will be stored in the dictionary. Thus, we conclude that the maximum size of such a string is 64 bytes. 4.4.3 DICTIONARY STUBS Dictionary Stubs immediately follow their respective headers and their format is determined by the category character in the Dictionary Header. The function of the stub is to organize the information appropriate to the symbol and provide a means of accessing additional information such as type descriptors, constant values, parameter lists and nested scopes. The format of each Stub is presented in the following sub-sections. 4.4.3.1 LABEL DECLARATIVES ("O") This Stub consists of a WORD whose function is (as yet) unknown. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 13 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.4.3.2 UN-TYPED CONSTANTS ("P") This Stub consists of (2) two fields: +00: An LG which points to a Type Descriptor (usually in SYSTEM.TPU). This establishes the minimum storage requirement for the constant. The rules vary with the type, but the size of the constant data field (which follows) is defined using the Type Descriptor(s). +04: The value of the constant. For ordinal types, this value is stored as a LONGINT (size=4 bytes). For Floating-Point types, the size is implicit in the type itself. For String types, the size is determined from the length of the string which is stored in the initial byte of the constant. 4.4.3.3 NAMED TYPES ("Q") This Stub consists of an LG (4-bytes) that points to the Type Descriptor for this symbol. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 14 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.4.3.4 VARIABLES, FIELDS, TYPED CONS ("R") This Stub contains information required to allocate and describe these types of entities. The format and content is as follows: +00: A one-byte flag that precisely identifies the class of the item being described. The known values and their proper interpretation is as follows: 0 -> Global Variables Allocated in DS; 1 -> Typed Constants Allocated in DS; 2 -> LOCAL Variables & VALUE Parameters on STACK; 6 -> ADDRESS Parameters allocated on STACK; 8 -> Fields suballocated in RECORDS and OBJECTS, plus METHODS declared for OBJECTS. +01: A WORD containing the allocation offset in bytes; +03: A WORD whose content depends on the one-byte flag that this stub begins with. The context-dependent values observed thus far are: If the flag is 0, 2 or 6, then this word is an LL that locates the containing scope or zero if none; If the flag is 8, then this word is an LL that locates the Dictionary Header for the next field or method defined within the Record or Object; If the flag is 1, then this word is an offset within the CONST DSeg Map that locates the text of the Typed Constant Data. +05: An LG that locates the proper Type Descriptor for this symbol. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 15 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.4.3.5 SUBPROGRAMS & METHODS ("S") Subprograms, especially since Object Methods are supported, have a rather involved stub. Its format is as follows: +00: A byte that contains bit-switches. These bit switches have a great deal to do with the size of this stub and with the proper interpretation of what follows. The observed values of the bit-switches are as follows: xxxxxxx1 -> Symbol declared in INTERFACE; xxxxxx1x -> Symbol is an INLINE Declarative; xxxx1x0x -> Symbol has EXTERNAL attribute; x001xxxx -> Symbol is an ordinary Object Method; x011xxxx -> Symbol is a CONSTRUCTOR Method; x101xxxx -> Symbol is a DESTRUCTOR Method; +01: A Word whose interpretation depends on whether we have an INLINE Declarative Subprogram or not. If this is an INLINE Declarative Subprogram, then this word contains the byte-count of the INLINE code text at the end of this stub. Otherwise, this word is the offset within the PROC Map that locates the object code for this Subprogram. +03: A Word that contains an LL which locates the containing scope in the dictionary, or zero if none. +05: A Word that contains an LL which locates the local Hash Table for this scope. A local hash table provides access to all formal parameters of the Subprogram as well as all Symbols whose declarations are local to the scope of this Subprogram. +07: A Word that is zero unless the symbol is a Virtual Method. In this case, then the content is the offset within the VMT for the owning object that defines where the FAR POINTER to this Virtual Method is stored. +09: A Word that is zero unless the symbol is a Method. In this case, then the content is an LL which locates the next METHOD for this Object. +0B: A complete Type-Descriptor for this Subprogram. The length is variable and depends upon the number of Formal Parameters declared in the header. A complete description of this subfield is found in a later section (4.4.4.3.2.6). +??: If this Symbol represents an INLINE Declarative Subprogram, then the object-code text begins here. The byte-count of the text occurs at offset 0001h in this stub. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 16 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.4.3.6 TURBO STD PROCEDURES ("T") This Stub consists of two bytes, the first of which is unique for each | procedure and increments by 4. I have found nothing in the SYSTEM | unit (which is where this entry appears) that this seems directly | related to. The second byte is always zero. | 4.4.3.7 TURBO STD FUNCTIONS ("U") This Stub consists of two bytes, the first of which is unique for each | function and increments by 4. I have found nothing in the SYSTEM unit | (which is where this entry appears) that this seems directly related | to. I wouldn't be surprised if this byte were an index into a TURBO | compiler table that points to specialized parse tables/action routines | for handling these functions and their non-standard parameter lists. | The second byte seems to be a flag having the values $00, $40 and $C0. | I strongly suspect that the flag $C0 marks exactly those functions | which may be evaluated at compile-time. The meaning behind the other | values is not known to me. | 4.4.3.8 TURBO STD "NEW" ROUTINE ("V") This Stub consists of a WORD whose function is (as yet) unknown. This | is the only Standard Turbo routine that can behave as a procedure as | well as a function (returning a pointer value). | 4.4.3.9 TURBO STD PORT ARRAYS ("W") This Stub consists of a byte whose value is 0 for byte arrays, and 1 for word arrays. 4.4.3.10 TURBO STD EXTERNAL VARIABLES ("X") This Stub consists of an LG (4-bytes) that points to the Type Descriptor for this symbol. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 17 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.4.3.11 UNITS ("Y") Unit Stubs have the following content: +00: A Word whose apparently reserved for use by the Compiler or Linker. +02: A Word that seems to contain some kind of "signature" used to detect inconsistent Unit Versions. This author suspects that this consists of some kind of sum-check or hash total but has not yet identified the algorithm which computes the value stored in this word. +04: A Word that contains an LL which locates the Successor Unit in the "Uses" list. In fact, the "Uses" lists of both the INTERFACE and IMPLEMENTATION sections of the Unit are merged by this Word into a single list. A value of zero is used to indicate no successor. +06: A Word that contains an LL which locates the Predecessor Unit in the "Uses" list. For the SYSTEM unit entry, this value is always zero to indicate no predecessor. For the Unit being compiled, this LL locates the final Unit in the combined "Uses" list. In effect, the two LL's at offsets 0004 and 0006 organize the units into both forward and backward linked chains. The entry for the unit being compiled is effectively the head of both the forward and the backward chains. The final unit in the merged "Uses" list is the tail of the forward chain, and the SYSTEM unit is the tail of the backward chain. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 18 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.4.4 TYPE DESCRIPTORS Type Descriptors store much of the semantic information that applies to the symbols declared in the unit. Implementation details can be managed using high-level abstractions and these abstractions can be shared. 4.4.4.1 SCOPE Type Descriptor sharing can occur across the boundaries which are implicit in unit modules. Thus, a type defined in one unit may be "imported" by some other module. Also, the pre-defined Pascal Types (plus the Turbo Pascal extensions) are defined in the SYSTEM.TPU unit and there needs to be a means of "importing" such Type Descriptors during compilation. This is precisely the objective of the LG locator which was described in section 2.2 (above). Type Descriptors are NEVER copied between units. The binding always occurs by reference at compile time and this helps support the technique of modifying a unit and compiling it to a .TPU file, then re-compiling all units/programs that "USE" it. Type Descriptors have many roles so their format varies. We have divided these structures into two parts: The PREFIX Part (which is always present and) whose format is fairly constant and the SUFFIX Part whose content and format depends on the attributes that are part of the type definition. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 19 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.4.4.2 PREFIX PART The Prefix Part of every Type Descriptor consists of four (4) bytes. The usage is consistent for all types observed by this author and the format is as follows: +00: A Byte that identifies the format of the Suffix part. This is essentially based on several high-level categories which the Suffix Parts support directly. The observed set of values is as follows: 00h -> an un-typed entity; 01h -> an ARRAY type; 02h -> a RECORD type; 03h -> an OBJECT type; 04h -> a FILE type (other than TEXT); 05h -> a TEXT File type; 06h -> a SUBPROGRAM type; 07h -> a SET type; 08h -> a POINTER type; 09h -> a STRING type; 0Ah -> an 8087 Floating-Point type; 0Bh -> a REAL type; 0Ch -> a Fixed-Point ordinal type; 0Dh -> a BOOLEAN type; 0Eh -> a CHAR type; 0Fh -> an Enumerated ordinal type. +01: A Byte used as a modifier. Since the above scheme is too general for machine-dependent details such as storage width and sign control, this modifier byte supplies additional data as required. The author has identified several cases in which this information is vital but has not spent very much time on the subject. The chief areas of importance seem to be in the 8087 Floating-Point types, and the Fixed-Point ordinal types. The semantics seem to be as follows: 0A 00 -> The type "SINGLE" 0A 02 -> The type "EXTENDED" 0A 04 -> The type "DOUBLE" 0A 06 -> The type "COMP" 0C 00 -> an un-named BYTE integer 0C 01 -> The type "SHORTINT" 0C 02 -> The type "BYTE" 0C 04 -> an un-named WORD integer 0C 05 -> The type "INTEGER" 0C 06 -> The type "WORD" 0C 0C -> an un-named double-word integer 0C 0D -> The type "LONGINT" ---------------------------------------------------------------------- Rev: August 11, 1990 Page 20 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- One important feature of the above semantics is the fact that an un-typed CONST declaration refers to the above two bytes to determine the storage space needed in the dictionary for the data value of the constant. This can be a little involved however as the constant may contain its own length descriptor (as in the case of a character string) in which case it may be sufficient to identify the high-level type category without any modifier byte. +02: A Word that contains the number of bytes of storage that are required to contain an object/entity of this type. For types that represent variable-length objects/entities such as strings, this word may define the value returned by the SIZEOF function as applied to the type. 4.4.4.3 SUFFIX PARTS Suffix Parts further refine the implementation details of the type and also provide subrange constraints where appropriate. In some cases the Suffix part is empty since all semantic data for the type is contained in the Prefix part. 4.4.4.3.1 UN-TYPED This Suffix Part is empty. Nothing is known about an un-typed entity. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 21 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.4.4.3.2 STRUCTURED TYPES The structured types represent aggregates of lower-level types. We include ARRAY, RECORD, OBJECT, FILE, TEXT, SET, POINTER and STRING types in this category. 4.4.4.3.2.1 ARRAY TYPES The Suffix Part of the ARRAY type is so constructed as to be able to support recursive or nested definition of arrays. The suffix format is as follows: +00: An LG that locates the Type Descriptor for the "base-type" of the array. This is the type of the entity being arrayed and may itself be an array. +04: An LG that locates the Type Descriptor for the array bounds which is a constrained ordinal type or subrange. 4.4.4.3.2.2 RECORD TYPES RECORD types have nested scopes. The Suffix part provides a base structure by which to locate the fields local to the scope of the Record type itself. The format is as follows: +00: A Word containing an LL which locates the local Hash Table that provides access to the fields in the nested scope. +02: A Word containing an LL which locates the Dictionary Header of the initial field in the nested scope. This supports a "left-to-right" traversal of the fields in a record. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 22 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.4.4.3.2.3 OBJECT TYPES OBJECT types also have nested scopes. The Suffix part provides a base structure by which to locate the fields and METHODS local to the scope of the OBJECT type itself. In addition, inheritance and VMT particulars are stored. The format is as follows: +00: A Word containing an LL which locates the local Hash Table that provides access to the fields and METHODS local to the nested scope. +02: A Word containing an LL which locates the Dictionary Header of the initial field or METHOD in the nested scope. This supports a "left-to-right" traversal of the fields and METHODS in an OBJECT. +04: An LG which locates the Type Descriptor of the Parent Object. This field is zero if there is no such Parent. +08: A Word which contains the size in bytes of the VMT for this Object. This field is zero if the object employs no Virtual Methods. +0A: A Word which contains the offset within the CONST DSeg Map that locates the VMT skeleton or template segment. This field equals FFFFh if the object employs no Virtual Methods. +0C: A Word which contains the offset within an Object instance where the NEAR POINTER to the VMT for the object is stored (within the DATA SEGMENT). This field equals FFFFh if the object employs no Virtual Methods. +0E: A Word which contains an LL which locates the Dictionary Header for the name of the OBJECT itself. 4.4.4.3.2.4 FILE (NON-TEXT) TYPES This Suffix consists of an LG that locates the Type Descriptor of the base type of the file. Note that the Type Descriptor may be that of an un-typed entity (for un-typed files). 4.4.4.3.2.5 TEXT FILE TYPES This Suffix consists of an LG that locates the Type Descriptor of the base type of the file -- in this case SYSTEM.CHAR. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 23 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.4.4.3.2.6 SET TYPES This Suffix consists of an LG that locates the base-type of the set itself. Pascal limits such entities to simple ordinals whose cardinality is limited to 256. 4.4.4.3.2.7 POINTER TYPES This Suffix consists of an LG that locates the base-type of the entity pointed at. 4.4.4.3.2.8 STRING TYPES This is a special case of an ARRAY type. The format is as follows: +00: An LG to the Type Descriptor SYSTEM.CHAR which is the base type of all Turbo Pascal Strings. +04: An LG to the Type Descriptor for the array bounds constraints for the string. 4.4.4.3.3 FLOATING-POINT TYPES The Suffix part for all Floating-Point types is EMPTY. All data needed to specify these approximate number types is contained in the Prefix part. The Types included in this class are SINGLE, DOUBLE, EXTENDED, COMP and REAL. 4.4.4.3.4 ORDINAL TYPES The Ordinal Types consist of the various "integer" types plus the BOOLEAN, CHAR and Enumerated types. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 24 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.4.4.3.4.1 "INTEGERS" These types include BYTE, SMALLINT, WORD, INTEGER and LONGINT. Their Suffix parts are identical in format: +00: A double-word containing the LOWER bound of the subrange constraint on the type; +04: A double-word containing the UPPER bound of the subrange constraint on the type; +08: An LG that locates the Type Descriptor of the largest upward compatible type. This is the Type Descriptor that is used to control the width of an un-typed constant in the dictionary stub. For the "integer" types, this is an LG to SYSTEM.LONGINT. 4.4.4.3.4.2 BOOLEANS This type Suffix has the following format: +00: A double-word containing the LOWER bound of the subrange constraint on the type; +04: A double-word containing the UPPER bound of the subrange constraint on the type; +08: An LG that locates the Type Descriptor SYSTEM.BOOLEAN. There is no "upward compatible" type. 4.4.4.3.4.3 CHARS This type Suffix has the following format: +00: A double-word containing the LOWER bound of the subrange constraint on the type; +04: A double-word containing the UPPER bound of the subrange constraint on the type; +08: An LG that locates the Type Descriptor SYSTEM.CHAR. There is no "upward compatible" type. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 25 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 4.4.4.3.4.4 ENUMERATIONS This type Suffix is unusual and has the following format: +00: A double-word containing the LOWER bound of the subrange constraint on the type; +04: A double-word containing the UPPER bound of the subrange constraint on the type; +08: An LG that locates the Prefix of the current Type Descriptor. There is no upward compatible type. What follows is a full-fledged SET Type Descriptor whose base type is the Type Descriptor of the Enumerated Type itself. The author has not yet discovered the reason for this. 4.4.4.3.5 SUBPROGRAM TYPES The length of this Suffix is variable. The format is as follows: +00: An LG that locates the Type Descriptor of the FUNCTION result returned by the Subprogram. This field is zero if the Subprogram is a PROCEDURE. +04: A Word that contains the number of Formal Parameters in the Function/Procedure header. If non-zero, then this word is followed by the parameter list itself as a simple array of parameter descriptors. The format of a parameter descriptor is as follows: 0000: An LG that locates the Type Descriptor of the corresponding parameter; 0004: A Byte that identifies the parameter passing mechanism used for this entry as follows: 02h -> VALUE of parameter is passed on STACK, 06h -> ADDRESS of parameter is passed on STACK. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 26 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 5. MAPS AND LISTS The "MAPS and LISTS" are not part of the symbol dictionary. Rather, these structures provide access to the Code and Data Segments produced by the compiler or included via the {$L name.OBJ} directive. The format and purpose (as understood by this author) of each of these tables is explained in the following sections. 5.1 PROC MAP The PROC Map provides a means of associating the various Function and Procedure declarations with the Code Segments. There is some evidence that the Compiler produces CODE (and DATA) Segments for EACH of the Subprograms defined in the Unit as well as for the un-named Unit Initialization code block. There is also evidence that EXTERNAL PROCs | must be assembled separately in order to exploit fully the Turbo "Smart Linker" since Turbo Pascal places some significant restrictions on EXTERNAL routines in the area of Segment Names and Types. Specifically, only code segments named "CODE" and data segments named "DATA" will be used by the "Smart Linker" as sources of code and data for inclusion in a Turbo Pascal .EXE file. The first entry in the PROC Map is reserved for Unit Initialization block. If there is no Unit Initialization block, this entry will be | filled with $FF. In addition, each and every PROC in the Unit has an | entry in this table. If an EXTERNAL routine is included, then ALL PUBLIC PROC definitions in that routine must be declared in the Unit Source Code with the EXTERNAL attribute. The size of the PROC Map Table (in Bytes) is implied in the Unit Header by the LL's that occur at offsets +0C and +0E. The Format of a single PROC Map Entry is as follows: +00: A Word that contains an offset within the CSeg Map. This is used to locate the code segment containing the PROC. +02: A Word that contains an offset within the CODE Segment that defines the PROC entry point relative to the load point of the referenced CODE Segment. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 27 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 5.2 CSEG MAP The CSeg Map provides a convenient descriptor table for each CODE Segment present in the Unit and serves to relate these segments with the Segment Relocation Data and the Segment Trace Table. It seems reasonable to infer that the "Smart Linker" is able to include/exclude code/data at the SEGMENT level only. The CSeg Map is an array of fixed-length records whose format is as follows: +00: A Word apparently reserved for use by TURBO. +02: A Word that contains the Segment Length (in bytes). +04: A Word that contains the Length of the Relocation Data Table for this Code Segment (in bytes). +06: A Word that contains the offset of the Trace Table Entry for this Segment (if it was compiled with DEBUG Support). If there is no Trace Table for this segment, then this Word contains FFFFh. 5.3 TYPED CONST DSEG MAP The CONST DSeg Map provides a convenient descriptor table for each DATA Segment present in the Unit which was spawned by the presence of Typed Constants or VMT's in the Pascal Code. It serves to relate these segments with the Segment Relocation Data and with the Code Segments that refer to these DATA elements. The CONST DSeg Map is an array of fixed-length records whose format is as follows: +00: A Word apparently reserved for use by TURBO. +02: A Word that contains the Segment Length (in bytes). +04: A Word that contains the Length of the Relocation Data Table for this DATA Segment (in bytes). +06: A Word that contains an LL which locates the OBJECT that owns this VMT skeleton or zero if the segment is not a VMT skeleton. It is possible to determine the containing scope for a Typed Constant declaration but -- unless it is for a VMT -- the job is a bit tedious. Essentially, one has to search the Symbol Dictionary for a declaration whose offset points to a given entry and the complete path to that symbol must be recorded. Our program doesn't do this but it can be done if the required dictionary entries are present. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 28 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 5.4 GLOBAL VAR DSEG MAP The VAR DSeg Map provides a convenient descriptor table for each DATA Segment present in the Unit. One entry exists for each CODE segment which refers to GLOBAL VAR's allocated in the DATA Segment. These references may be seen in the Relocation Data Table. Each EXTERNAL CSeg having a segment named DATA also spawns an entry in this table. Only the Code Segments that meet these criteria cause entries to be generated in the VAR Dseg Map. The VAR DSeg Map is an array of fixed-length records whose format is as follows: +00: A Word apparently reserved for use by TURBO. +02: A Word that contains the Segment Length (in bytes). This may be zero, especially if the EXTERNAL routine contains a DATA segment whose sole purpose is to declare one or more EXTRN symbols that are defined in some DATA segment external to the Assembly. +04: A Word apparently reserved for use by TURBO. +06: A Word apparently reserved for use by TURBO. To determine the identity of the CSeg that owns some particular entry in this table, examine the Relocation Data for ALL CSegs. Each CSeg which makes reference to a DATA segment has an entry in this table. 5.5 DONOR UNIT LIST This list contains an entry for each Unit (taken from the "USES" list) which MAY contribute either CODE or DATA to the executable file. Not all units do make such a contribution as some exist merely to define a collection of Types, etc. A Unit gets into this list if there exists a single Relocation Data Entry that references CODE or DATA in that Unit. The list is comprised of elements whose SIZE is variable and whose format is as follows: +00: A WORD apparently reserved for use by TURBO. +02: A variable-length String containing the unit name. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 29 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 5.6 SOURCE FILE LIST This list contains an entry for each "source" file used to compile the Unit. This includes the Primary Pascal file, files containing Pascal code included by means of the {$I filename.xxx} compiler directive, and .OBJ files included by the {$L filename.OBJ} compiler directive. The order of entries in this list is critical since it maps the CODE segments stored in the unit. The order of the entries is as follows: 1) The Primary Pascal file; 2) All Included Pascal files; 3) All Included .OBJ files. Mapping of CSegs to files is done as follows: a) Each .OBJ file contributes a SINGLE Code Segment (if any). Note that this author has not observed an .OBJ module that contains only a DATA Segment (but that seems a distinct possibility). b) The Primary Pascal file (augmented by all included Pascal Files) contributes zero or more CODE Segments. Therefore, there are at least as many CSeg entries as .OBJ files. If more, then the excess entries (those at the front of the list) belong to the Pascal files that make up the Pascal source for the unit. The format of an entry in this list is as follows: +00: A flag byte that indicates the type of file represented; 04h -> the Primary Pascal Source File, 03h -> an Included Pascal Source File, 05h -> an .OBJ file that contains a CODE segment. +01: A Word apparently reserved for use by the Compiler/Linker. +03: A Word that is zero for .OBJ files and which contains the file directory time-stamp for Pascal Files. +05: A Word that is zero for .OBJ files and which contains the file directory date-stamp for Pascal Files. +07: A variable-sized string containing the filename and extension of the file used during compilation. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 30 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 5.7 DEBUG TRACE TABLE If Debug support was selected at compile time, then all Pascal code which supports Debugging produces an entry in this table. The table entries themselves are variable in size and have the following format: +00: A Word which contains an LL that locates the Directory Header of the Symbol (a PROC name) this entry represents. +02: A Word which contains the offset (within the Source File List) of the entry that names the file that generated the CSeg being traced. This allows the file included by means of the {$I filename} directive to be identified for DEBUG purposes, as well as code produced from the Primary File. +04: A Word containing the number of bytes of data that precede the BEGIN statement code in the segment. For Pascal PROCS these bytes consist of literal constants, un-typed | constants, and other data such as range-checking limits, | etc. +06: A Word containing the Line Number of the BEGIN statement for the PROC. +08: A Word containing the number of lines of Source Code to Trace in this Segment. +0A: An array of bytes whose size is at least the number of source code lines in the PROC. Each byte contains the number of bytes of object code in the corresponding source line. This appears to be an array of SHORTINT since if a "line" contains more than 127 bytes, then a single byte of $80 precedes the actual byte count as a sort of "escape" and the next byte records the up to 255 bytes for the | line. This situation has not yet been fully explored. We | do not yet know what happens in the event a line is | credited with spawning more than 255 bytes of code. | ---------------------------------------------------------------------- Rev: August 11, 1990 Page 31 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 6. CODE, DATA, RELOCATION INFO This area begins at the start of the next free PARAGRAPH. This means that its offset from the beginning of the Unit ALWAYS ends in the digit zero. This area contains the CODE segments, CONST DATA segments, and the Relocation Data required for linking. 6.1 OBJECT CSEGS Each CODE segment included in the unit appears here as specified by the CSeg Map Table. Depending on usage, these segments may appear in the executable file. There are no filler bytes between segments. 6.2 CONST DSEGS This section begins at the start of the first free PARAGRAPH following the end of the Object CSegs. This means that its offset from the beginning of the Unit ALWAYS ends in the digit zero. A DATA segment fragment appears here for each CSeg that declares a typed constant, and for each OBJECT which employs Virtual Methods. There are no filler bytes between segments. If local symbols were generated, there is always enough information to allow documenting the scope of the declaration as well as interpreting the data in the display since the needed type declarations would also be available. Our program doesn't go to this extreme however. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 32 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 6.3 RELOCATION DATA TABLE This table begins at the start of the first free PARAGRAPH following the end of the CONST DSegs. This means that its offset from the beginning of the Unit ALWAYS ends in the digit zero. There are two | sections in this table: one for code, and one for data. Both | sections are aligned on paragraph boundaries. This may result in a | "slack" entry between the code and data sub-sections, but this entry | is included in the byte tally for the section stored in the Unit | Header Table at ULPtch (offset +20). | The table begins with entries for the CSeg Map and ends with entries for the CONST DSeg Map. The appropriate Map entry specifies the number of bytes of Relocation Data for the corresponding segment. This number may be zero in which case there is no Relocation Data for the given segment. | The Table consists of an array of eight (8) byte entries whose format is as follows: +00: A Byte containing the offset within the Donor Unit List of the Unit name that this entry refers to. This can be the compiled Unit or some previously compiled external unit. +01: A Byte that defines the type of reference being made and implies the size of the pointer needed (WORD or DWORD). The known and/or observed values are as follows: 00h -> a WORD refers to a PROC Map. 10h -> a WORD refers to a PROC Map. 20h -> a WORD refers to a PROC Map. 30h -> a DWORD pointer refers to a PROC Map. 50h -> a WORD refers to a CSeg Map. 60h -> a WORD refers to an unknown Map. 70h -> a DWORD pointer refers to a CSeg Map. 90h -> a WORD refers to a VAR DSeg Map. A0h -> a WORD refers to a DSeg Map for SEG address. | D0h -> a WORD refers to a CONST DSeg Map. +02: A Word containing the offset within the Map table referenced according to the above code scheme. +04: A Word containing an offset within the target segment which will be added to the effective address. For example, a reference to the VAR DSeg Map will require a final offset to locate the item (variable) within the DATA SEGMENT being referenced here. This may also be needed for references to LITERAL DATA embedded in a CODE SEGMENT. +06: A Word containing the offset within the CODE or DATA segment owning this entry that contains the area to be | patched with the value of the final effective address. | ---------------------------------------------------------------------- Rev: August 11, 1990 Page 33 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- For some truly wild guessing about the flag byte above, the following | pattern seems to be emerging. Look at bits 7-4 of this byte. It | appears that the type of Map reference may be coded into bits 7-6 and | that the size or type of reference may be coded into bits 5-4. Note | that bits 7-6 are "00" for PROC Map items, "01" for CSeg Map items, | "10" for Global DSeg Map items, and "11" for Const DSeg Map items. It | appears that the size or type of reference may be coded into bits 5-4. | Note that all FAR (DWORD) pointer references show these bits as "11" | and that a SEGMENT Register value appears as "10" and that WORD values | otherwise appear as "01" or "00". Further, no type 00h item has been | seen which has a non-zero effective address adjustment. This all | seems to suggest the following code structure: | 7654 3210 (bits 3-0 don't seem to be used) | 00-- ---- Locate item via a PROC Map, | 01-- ---- Locate item via a CSeg Map, | 10-- ---- Locate item via a Global DSeg Map, | 11-- ---- Locate item via a Const DSeg Map, | --00 ---- WORD offset has NO effective address adjustment, | --01 ---- WORD offset HAS an effective address adjustment, | --10 ---- WORD is content of a SEGMENT Register such as DS | or CS. | --11 ---- DWORD (FAR) pointer is supplied with possible | effective address adjustment. | The evidence in support of this conjecture is both slim and vast. It | all depends on how much data one looks at. I have looked at a lot of | data from the Borland supplied units and I haven't found anything to | refute the above. Accordingly, the supplied program interprets this | flag byte according to this scheme. | 7. SUPPLIED PROGRAM In order that the above information be made constructively useful, the author has designed a program that automates the process of discovery. It is not a "handsome" program and it is not a work of art. It does give useful results provided your PC has enough available memory. It should be obvious that the program was not designed "top-down". Rather, it just evolved as each new discovery was made. Later on, it seemed reasonable to try to document some of the relations between the various lists and tables and the program tries to make some of these relations clear, albeit with varying degrees of success. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 34 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 7.1 TPUNEW | This is the main program. It will ask for the name of the unit to be documented. Reply with the unit name only. The program will append the ".TPU" extension and will search for the proper file. The program will then ask if Dis-Assembly is desired and will require a "y" or "n" answer. The current directory will be searched first, followed by all directories in the current PATH. The program will NOT search a ".TPL" (Turbo Pascal Library) file. If the desired unit is found, the program will write a report to the current directory named "unitname.lst" which contains its analysis. The format of the report is such that it may be copied to a printer if that printer supports TTY control codes with form-feeds. Be judicious in doing this however since there can be a lot of information. The Turbo SYSTEM.TPU unit file produces almost ninety (90) pages without | the disassembly option. When disassembly is requested for the SYSTEM | unit, the size of the output file exceeds 700K bytes. | 7.2 TPURPT1 This is a Unit that contains the text-file output primitives required by the main program. It's not very pretty but it does work. 7.3 TPUAMS1 This Unit contains all Type Definitions, Structures, and "Canned" Functions and Procedures required by the main program. All structures documented in this report are also documented in TPUAMS1 by means of the TYPE mechanism. Some of the structures are difficult if not impossible to handle using ISO Pascal but Turbo Pascal provides the means for getting the job done. 7.4 TPUUNA1 This unit is a rudimentary disassembler. The output will not assemble and may look strange to a real assembler programmer since this author is not so-qualified. However, the basis for support of 80286, 80386 etc. processors is present as well as coprocessor support. Of perhaps the greatest interest is that it does appear to decode the emulated coprocessor instructions that are implemented via INT 34-3D. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 35 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- Be warned however. The output is not guaranteed since this was coded by myself and I am perhaps the rankest amateur that ever approached this quite awful assembler language. For convenience, the operand coding mimics TASM "Ideal" mode. As is usual with programs of this type, error-recovery is minimal and no context checking is performed. If the operation code is found to be valid, then a valid instruction is assumed -- even if invalid operands are present. The only positives that apply to this program are that it doesn't slow the cpu down (although a lot more output is produced), and it does let one "tune" code for compactness by letting one view the results of the coding directly. Also, incomplete instructions are handled as data | rather than overrunning into the next proc. | 7.5 MODIFICATIONS It was intended from the beginning that this program should be able to be enhanced to permit external units to be referenced during the analysis of any given unit, even if they were library components. The author hopes that users so-inclined will find the code pliable enough to engineer such enhancements. No small amount of care was expended to make pointer references flexible enough so that more than one unit could be addressed at one time. However, none of the references to external units are resolved by the program as it now stands. This program was NOT intended as a pilot for some future product. It | WAS intended as a rather "ersatz" tool for myself. | 7.6 NOTES ON PROGRAM LOGIC | The following sections discuss a few of the methods employed by the supplied program. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 36 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 7.6.1 FORMATTING THE DICTIONARY | Printing the unit dictionary area in a way that exposes its underlying | semantics is no small task. The unit dictionary area itself is a | rather amorphous-looking mass of data composed of hash tables, | dictionary headers and stubs, type descriptors, etc. In order to | present all this information in a meaningful way, we have to reveal | its structure and this cannot be done by means of a sequential | "browse" technique. Rather, we have to visit all nodes in the | dictionary area so that each may be formatted in a way that exposes | their function and meaning. This is made necessary by the fact that | items are added to the dictionary as encountered and no convenient | ordering of entry types exists. What we have here is the problem of | finding a minimal "cover" for the dictionary area that properly | exposes the content and structure of the dictionary area. | To do this, we construct (in the heap) a stack and a queue, both of | which are initially empty. The entries we put in the stack identify | the class of entry (Hash Table, Dictionary Header, Type Descriptor or | In-Line Code group), the location of the structure, and the location | of its immediate "owner" or "parent" dictionary entry (which allows | some limited information about scope to be printed). | To the empty stack, we add an entry for the unit name dictionary | entry, the INTERFACE hash table, and the DEBUG hash table. All these | are located via direct pointers (LL's) in the Unit Header Table. We | then pop one entry off the stack and begin our analysis. | a) If the entry we popped off the stack is not present in the | queue, we add it and call a routine that can interpret the entry | (aka, "cover") for a Dictionary Header, Hash Table, or Type | Descriptor. (This may lead to additional entries being added to | the stack such as nested-scope hash tables, Dictionary Headers, | Type Descriptors or In-Line Code group entries.) | b) While the stack is not empty, we pop another entry and repeat | step "a" (above) until no more entries are available. | The result is a queue containing one entry for each structure in the | unit dictionary area that is identifiable via traversal. (In | practice, the method we use is similar to a "breadth-first" traversal | of an n-way tree that is implemented in non-recursive fashion.) Each | entry in the queue contains the information described above and the | queue itself thus forms a set of descriptors that drive the process of | formatting the dictionary area for display. The process may be | likened to "painting by the numbers" or to finding a way to lay tile | on a flat surface using tiles of four different irregular shapes until | the floor is exactly covered. | There is one significant limitation that needs to be pointed out. It | is not always possible to determine the "parent" or "owner" of a node | with certainty. The following discussion illustrates the problem of | finding the "real" parent of a Type Descriptor. | ---------------------------------------------------------------------- Rev: August 11, 1990 Page 37 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- Almost every "type" in Pascal is actually derived from the basic types | that are (in Turbo Pascal) defined in the SYSTEM.TPU unit -- e.g. | "INTEGER", "BYTE", etc. In addition, several of the Type Descriptors | in the SYSTEM unit are referenced by more than one Dictionary Entry. | Thus, we find that a "many-to-one" relationship may exist between | Dictionary Entries and Type Descriptors. How does one find out which | is the entry that actually gave rise to the Type Descriptor? | The Dictionary Area of a unit has some special properties, one of | which is the fact that the Dictionary Entries for named Types are | often located quite near their primary type descriptors. The | Dictionary Area seems to be treated as an upward growing heap with the | various structures being added by Turbo as needed. This makes it | likely that the Type "Q" header which gives rise to a type descriptor | is quite likely to occur earlier in the Dictionary Area than any other | header which refers to the same descriptor. We take advantage of this | property to allocate "ownership" but it may not be "fool-proof". Some | type descriptors are spawned by other type descriptors, especially for | structured types. We don't attempt to allocate "ownership" to these | "lower-level" descriptors. | 7.6.2 THE DISASSEMBLER | To start with, I apologize up front for mistakes which are bound to be | present in this routine. I am not a MASM or TASM programmer and I | will not pretend otherwise. This being the case, the formatting I | have chosen for the operands may be erroneous or misleading and might | (if submitted to one of the "real" assemblers) produce object code | quite different from what is expected. I hope not, but I have to | admit it's possible. | My intention in adding this unit was to permit tuning of object code | to be made possible. With practice and some effort, one can observe | the effect on the object module caused by specific Pascal coding. | Thus, where compactness is an issue of paramount importance, TPUUNA1 | can be of help. In some cases, a simple re-arrangement of the local | variable declarations in a procedure can have a significant effect of | the size of the code if it means the difference between 1 and 2-byte | displacements for each instruction that references a specific local | variable. Potential applications along these lines seem almost | unlimited. | ---------------------------------------------------------------------- Rev: August 11, 1990 Page 38 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- I adopted an operand format not unlike that of TASM "Ideal" mode since | it was more convenient to do so and looked more readable to me. I | relied on several reference books for guidance in decoding the entire | mess and I found that there were several flaws (read ERRORS) in some | of them which made the job that much more difficult. I then | compounded my problems by attempting to handle 80286 and 80386 | specific code even though Turbo Pascal does not generate code specific | to these processors. I simply felt that the effort involved in | writing any sort of Dis-Assembly program for Turbo Pascal units was an | effort best experienced not more than once. With all this self- | flagellation out of my system once and for all, I will try to show the | basic strategy of the program and to explain the limitations and some | of the discoveries I made. | The routine is intended to be idiotically simple - i.e., no smarter | than the DEBUG command in principle. The basic idea is: pass some | text to the routine and get back ONE line derived from some prefix of | that text. Repeat as necessary until all text is gone. Thus, there | is no attempt to check the context of the text being processed. Also, | some configurations of the "modR/M" byte may invalid for selected | instructions. I don't try to screen these out since the intent was to | look at the presumably correct code produced by TURBO Pascal -- not | devious assembly language. Also, this program regards WAIT operations | as "stand-alone" -- i.e., it doesn't check to see if a coprocessor | operation follows for which the WAIT might be regarded as a prefix. | One area of real difficulty was figuring out the Floating-Point | emulations used by Turbo Pascal that are implemented by means of | interrupts $34 through $3D. I don't know if I got it right, but the | results seem reasonable and consistent. In the listing, the Interrupt | is produced on one line, followed by its parameters on the next line. | The parameter line is given the op-code "EMU_xxxx" where "xxxx" is the | coprocessor op-code I felt was being emulated. Interrupt $3C was a | real puzzler but after seeing a lot of code in context, I think that | the segment override is communicated to the emulator by means of the | first byte after the $3C. | Normally, in a non-emulator environment, all coprocessor operations | (ignoring any WAIT prefixes) begin with $D8-$DF. What Borland (and | maybe Microsoft) seem to have done here is to change the $D8-$DF so | that bits 7 and 6 of this byte are replaced with the one's complement | of the 2-bit segment register number found in various 8086 | instructions. This seems to be how an override for the DS register is | passed to the emulator. I don't KNOW this to be the correct | interpretation, but the code I have examined in context seems to work | under this scheme, so TPUUNA uses it to interpret the operand | accordingly. | For 80x86 machines, the problem was somewhat simpler. TPUUNA takes a | quick look at the first byte of the text. Almost any byte is valid as | the initial byte of an instruction, but some instructions require more | than one byte to hold the complete operation code. Thus, step 1 | classifies bytes in several ways that lead to efficient recognition of | valid operation codes. | ---------------------------------------------------------------------- Rev: August 11, 1990 Page 39 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- Once the instruction has been identified in this way, it is more or | less easy to link to supplemental information that provides operand | editing guidance, etc. | The tables that embody the recognition scheme were constructed using | PARADOX 3.0 (another fine Borland product) and suitably coded queries | were used to generate the actual Turbo Pascal code for compilation. | For those that are interested, TPUUNA supports the address-size and | operand-size prefixes of the 80386 as well as 32-bit operands and | addresses but remember that Turbo Pascal doesn't generate these. A | trivial change is provided for which allows segments which default to | 32-bit mode to be handled as well. | There is a simple mode variable that gets passed to TPUUNA by its | caller which specifies the most-capable processor whose code is to be | handled. Codes are provided for the 8086 (8088 is the same), 80186 | (same as 80286 except no protected mode instructions), 80286 (80186 | plus protected mode operation), and 80386. | No such specifier is provided for coprocessor support. What is there | is what I think an 80387 supports. I don't think that this is really | a problem if you don't try to use TPUUNA for anything but Turbo Pascal | code. | Error recovery is predictably simple. The initial text byte is output | as the operand of a DB pseudo-op and provision is made to resume work | at the next byte of text. | I hope this program is found to be useful in spite of the errors it | must surely contain. I have yet to make much sense of the rules for | MASM or TASM operand coding and I found very little of value in many | of the so-called "texts" on the subject. I found myself in the | position of that legendary American watching a Cricket match in | England for the first time ("You mean it has RULES?"). | ---------------------------------------------------------------------- Rev: August 11, 1990 Page 40 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 8. UNIT LIBRARIES This author has examined .TPL files in passing and concludes that their structure is trivial in the extreme. The following notes should be of some help. 8.1 LIBRARY STRUCTURE A Turbo Pascal Library (.TPL) file appears to be a simple catenation of Turbo Pascal Unit (.TPU) files. Since the length of a Unit may be determined from the Unit Header (see section 3.2), it is simple to see that one may "browse" through a .TPL file looking for an external unit such as SYSTEM.TPU. If this seems to be too much effort, then there is always the TPUMOVER Utility program supplied by Borland. 8.2 THE TPUMOVER UTILITY Quite simply, this Utility allows one to extract units from .TPL files in order to subject them to the analysis performed by TPUMAIN. Read your Turbo Pascal User's Guide for instructions on the operation and use of this utility. 9. APPLICATION NOTES One of the more obvious applications of this information would seem to be in the area of a Cross-Reference Generator. There is a very fine example of such a program in the public domain that was written by Mr. R. N. Wisan called "PXL". This program has been around since the days of Turbo Pascal Version 1. The program has been continually enhanced by the author in the way of features and for support of the newer Turbo Pascal versions. It does not however solve the problem of telling one which unit contains the definition of a given symbol. In fairness to "PXL" however, this is no small problem since the format of .TPU files keeps changing (Turbo 5.5 Units are not object-code compatible with Turbo 5.0 Units, and so on...) and Mr. Wisan probably has more than enough other projects to keep himself occupied. However, for the user who is willing to work a little (maybe a lot?), this document would seem to provide the information needed to add such a function to his own pet cross-reference generator. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 41 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 10. ACKNOWLEDGEMENTS This project would have been totally infeasible without the aid of some very fine tools. As it was, several hundred man hours have been expended on it and as you can see, there are a few unresolved issues that have been (graciously) left for others to address. The tools used by this author consisted of: 1) Turbo Pascal 5.5 Professional by Borland International 2) Microsoft WORD (version 5.0) 3) LIST (version 6.4a) by Vernon D. Buerg 4) the DEBUG utility in MS-DOS Version 3.3. 5) PARADOX 3.0 by Borland International | 6) QUATTRO PRO by Borland International | 7) TURBO ASSEMBLER 1.1 by Borland International | (PARADOX and QUATTRO PRO were used for data collection and analysis in | the course of coding the recognizer tables for the disassembler unit.) | The references listed were of great value in this project. [Intel85] | was a valuable source of information about coprocessor instructions as | well as offering hints about the differences between the 8086/8088 and | the 80286. The [Borland] TASM manuals offered further info on the | 80186. [Nelson] provided presentations of well-organized data | directed at the problem of disassembly but the tables were flawed by a | number of errors which crept into my databases and which caused much | of the extra debugging effort. [Intel89] offered valuable insights on | the 80386 addressing schemes as well as the 32-bit data extensions. | Finally, [Brown] provided valuable clues on the Floating-Point | emulators used by Borland (and Microsoft?). As you can see, the | amount of hard information available to me on this project was quite | limited since I am unaware of any other existing body of literature on | this subject. | That's it folks. Does anyone wonder why it took several hundred man hours to get to this point? It took a lot of hard (and at times tedious) work coupled with a great many lucky guesses to achieve what you see here. ---------------------------------------------------------------------- Rev: August 11, 1990 Page 42 Inside TURBO Pascal 5.5 Units ---------------------------------------------------------------------- 11. REFERENCES [Bor88a], TURBO ASSEMBLER REFERENCE GUIDE, Borland International, | 1988. | [Bor88b], TURBO ASSEMBLER USER'S GUIDE, Borland International, 1988. | [Bor88c], TURBO PASCAL REFERENCE GUIDE Version 5.0, Borland | International, 1988. | [Bor88d], TURBO PASCAL USER'S GUIDE Version 5.0, Borland | International, 1988. | [Bor89], TURBO PASCAL 5.5 OBJECT-ORIENTED PROGRAMMING GUIDE, Borland | International, 1989. | [Brown], INTER489.ARC, Ralf Brown, 1989 | [Intel85], iAPX 286 PROGRAMMER'S REFERENCE MANUAL INCLUDING THE iAPX | 286 NUMERIC SUPPLEMENT, Intel Corporation, 1985, (order | number 210498-003). | [Intel89], 386 SX MICROPROCESSOR PROGRAMMER'S REFERENCE MANUAL, Intel | Corporation, 1989, (order number 240331-001). | [Nelson], THE 80386 BOOK: ASSEMBLY LANGUAGE PROGRAMMER'S GUIDE FOR | THE 80386, Ross P. Nelson, Microsoft Press, 1988. | [Scanlon], 80286 ASSEMBLY LANGUAGE ON MS-DOS COMPUTERS, Leo J. | Scanlon, Brady 1986. | ---------------------------------------------------------------------- Rev: August 11, 1990 Page 43