A Theory Of Documents: How SGML Can Change The Way We Look At Information

NOTE: This article is the document that accompanied a talk given at SGML ’95 by R. Alexander Milowski

R. Alexander Milowski, President and Principal Researcher, has been heavily involved in the development of XML and is a participant on XML working group. He has been considered a leader in the SGML community for several years, has presented papers at international SGML conferences, and is the President of the regional “Midwest SGML Forum”. He is an author of a book on DSSSL, a new ISO Standard for stylesheets and transformations for SGML. In addition he is the principal architect of various HTML, XML, SGML, and DSSSL tools.

We are reposting this great presentation in case people would like to revisit this. All credits go to R. Alexander Milowski.



Related Information

Traditional information systems have usually treated the problem of producing documents with a process of extraction. That is, one must run a “report query” and the information is extracted, processed, transformed, and either presented for viewing or printing. Such a process means that document generation can be a tedious process and can easily get out of synchronization with the information that produced the documents.In the following passages, a document-centric framework will be outlined such that SGML application systems and SGML information can be produced that encapsulates all facets of the production and process of information. These systems will use SGML as the native interchange and medium for semantics in both presentation and manipulation.

Dynamic Entities And Storage Managers

Databases exist to manage large amounts of information in an efficient manner. The key idea is not the “efficient” part but the “managing” part of the database. That is, databases allow us to manipulate
large amounts of information without worrying about the validity of the information–views can be built that represent the subset of the information a user is interested in and that view can be kept current.

In the SGML world, the incorporation of such databased information usually meant creating extraction processes that would extract the information to the file system and create large entities that could be incorporated into a document.

Since the amount of information could be quite large, this extraction poses two significant problems:

  • The information can no longer be kept up-to-date by the database without re-extracting the information.
  • The information can no longer be manipulated without changing the export process.

In addition, if we look at the use of this type of information from a document perspective we may see many views of the same information. For example, we may have a parts database which, in one catalog, we list the whole database as a catalog. In another instance, we may use the same information reduced to only those parts that are used on a particular product. Thus, documents have views of a database just like a user application might.

What is needed seems to fall into two basic needs:

  • Information must be able to be accessed in multiple ways and through multiple “views”.
  • This information must be “dynamically” accessed such that it can be requested from a repository when needed.

In the SGML world, there is a solution to this–the recently produced ISO/IEC 10744 (HyTime) Technical Corrigendum 1, Annex D. In this annex there is a relatively new concept has been formalized–formal system identifiers.

Formal systems identifiers specify that a system identifier has two parts: the storage manager name and the storage object identification. The storage manager name identifies a storage domain like the file system that an entity manager can access. That is, it is the entry point to a storage system and that can translate things like a file path to the actual file contents.

The storage object identification is simply whatever the storage manager needs to locate the storage object within its storage system. In the case of the OSFILE storage manager, which manages the file system, the storage object identification is just the path–relative or absolute–of the file to be accessed. Thus, a storage manager defines the syntax of its storage object identifiers.

What this means is that a database or any other repository can become an SGML entity (or entities in the case of views) by creating a storage manager that can access the information from the database. We can specify the information necessary to access a database in the storage object identifier and create a storage manager that accesses the database and transforms the information into appropriate SGML. Thus, our database is just another external text entity and an author only needs to insert an entity reference to access databased information.

For example, a relational database (SQL) exists that stores our part data. From this databse we need to create a parts catalog from this data. We have designed our SGML such that all we need to do is to create the parts data extract as tabular SGML that can be reference as an entity. What we really need is for the parts database to become a set of entities that we can call into our documents. When we call them into our document using entity references, we want a query to be executed that retrieves the part data from the database.

To solve this using a storage manager we may do the following:

  • Create a SQL storage manager that can execute SQL against a database.
  • Create a query that extracts the necessary part data.
  • Have the SQL storage manager tag each result row with a part tag.
  • Have the SQL storage manager tag each result row item with the column
    name it came from.

For example, say our part data was stored in a table named partlist with three columns: number, desc, price. These columns represent the part number, the part description, and the part price respectively. All our data we need for the document is stored in one table.

A subset of this information might look like the following:

Example part data

P10345Widget Screw0.03
P12340Widget Panel5.45
P10347Widget w/o Screws90.00
P10398Widget w/o Panel75.03
P20053Widget A100.00
P13945Widget B150.00
P10344Widget C90.00
P12435Widget D120.00
P10034Widget E350.00

We want the above information to be tagged according to this set of declarations:

<!ELEMENT PartList – – (Part+) > <!ELEMENT Part – – (Number,Desc,Price) > <!ELEMENT Number – – (#PCDATA) > <!ELEMENT Desc – – (#PCDATA) > <!ELEMENT Price – – (#PCDATA) >

A storage manager called SQL has been created that can connect to an arbitrary SQL database and execute SQL queries. When the SQL query is executed, the result set is automatically translated into tabular SGML (since SQL is always tabular, this can be done automatically).

In this case, when the SQL storage manager executes the query “select * from partlist” the result set will be coded in the following fashion:

  • A start tag <PartList> will be output at the beginning of the result since the table name (and result name) is partlist.
  • Each result row will be tagged with a start and end Part element.
  • Each result row column will be tagged with the table column name it came from.
  • An end tag </PartList> will be output at the beginning of the result.

Thus, the following fragment would be generated:

<partlist> <part><Number>P10345</Number><Desc>Widget Screw</Desc><Price>0.03<Price></part> <part><Number>P12340</Number><Desc>Widget Panel</Desc><Price>5.45<Price></part> <part><Number>P10347</Number><Desc>Widget w/o Screws</Desc><Price>90.00<Price></part> <part><Number>P10398</Number><Desc>Widget w/o Panel</Desc><Price>75.03<Price></part> <part><Number>P20053</Number><Desc>Widget A</Desc><Price>100.00<Price></part> <part><Number>P13945</Number><Desc>Widget B</Desc><Price>150.00<Price></part> <part><Number>P10344</Number><Desc>Widget C</Desc><Price>90.00<Price></part> <part><Number>P12435</Number><Desc>Widget D</Desc><Price>120.00<Price></part> <part><Number>P10034</Number><Desc>Widget E</Desc><Price>350.00<Price></part> </partlist>

For an SGML application to access this entity, a catalog entry must be entered in our SGML Open catalog with the following syntax:

ENTITY Data.PartList “<SQL>PARTDB//Part//select * from partlist”In the above catalog entry, the formal system identifier represents the storage manager by the tag <SQL>. For this storage manager, its convention for identifying what query to execute is that the database name should be entered followed by a double-slash, followed by the result item name, followed by another double-slash, followed by the SQL query.

Hence, every time the entity Data.PartList is requested from an SGML application, a query is executed to retrieve the data, tagged appropriately. In fact, several entities could be setup that select different subsets of information for different kinds of documents. In that case, the same information could be reused without fear of inconsistencies.

Of course, storage managers don’t have to be just data retrieval. They could, for example, execute algorithms to make decisions about the paths that documents could take or they could return the appropriate entity for some section depending on conditions. Essentially, as long as the returned information is in SGML, an SGML application can use the information.

Components For Reuse

Once we have the concept of an extendible entity manager, the natural question arises of how information should be divided and encoded in SGML. In traditional applications–relational databases–this process is called database normalization. Essentially, it is a process of reducing information to its basic sets and relationships such that no redundancy exists in the actual storage. Once the redundancy is removed, “views” are constructed to present appropriate commonly used information in a regular fashion.

In SGML, this is a question of componentization of entities. That is, defining SGML constructs–both definitions and instance fragments–such that they can be reused without duplication of information. As it turns out, this process is remarkably similar to database normalization.

In this SGML “data normalization” process, there are two major processes. First, the DTD must be componentized such that the structure definitions can be reused in many different documents. Second, the actual instance data must be componentized into reusable components according to the structure componentization accomplished within the DTD and according to how the information needs to be shared.

DTD Components

DTD Componentization is a process of deciding what components must be reused. The basic concept is deciding, as in relational database normalization, what are the basic units of information, structuring the DTD to those basic units, and providing associative and view structures such that more complex views or associations of the data can be presented.

For example, a common occurrence in many documents is a “person” or “customer” element. This element is used to encode information that may be contained within a customer database or entered by hand each time the document type is used.

As with any customer-information oriented system, it would greatly increase efficiency and the accuracy of the information if the data was entered once and used only once in the document. Accomplishing this in SGML is a matter of doing the following:

  • Creating an element that one-to-one encapsulates the customer information that is to be collected, stored, and maintained. This element should have a unique identifier attribute.
  • Creating a “prolog” element such that such information can be embedded in the SGML instance.
  • Creating a pointer element that can be used where appropriate to point to the customer element.

The first step in this process–the one-to-one encapsulation–is a process of defining an element with structure such that it can encode all the necessary customer information. Once that is accomplished, the declarations need to be organized in such a way that the elements can be reused.

For this example, we will say that all the customer declarations can be held in one file. In a more realistic example, some declarations may be shared by many different components (such an element might be “NAME” or “PARA”–something very common).

Such a declaration might be like the following:

<– Common Models –> <!ENTITY % Common.M.Data “(#PCDATA)” > <– Common Elements –> <!ENTITY % Common.Customer “(Customer)” > <!ENTITY % Common.Address “(Address)” > <!ENTITY % Common.Name “(Name)” > <!ENTITY % Common.Phone “(Phone)” > <!ENTITY % Common.Street “(Street)” > <!ENTITY % Common.City “(City)” > <!ENTITY % Common.State “(State)” > <!ENTITY % Common.Zip “(Zip)” > <– Common Elements Definitions –> <!ELEMENT Customer – – (Name,Phone,Address) > <!ELEMENT Address – – (Street+,City,State,Zip) > <!ELEMENT Name – – (%Customer.M.Data;) > <!ELEMENT Phone – – (%Customer.M.Data;) > <!ELEMENT Street – – (%Customer.M.Data;) > <!ELEMENT City – – (%Customer.M.Data;) > <!ELEMENT State – – (%Customer.M.Data;) > <!ELEMENT Zip – – (%Customer.M.Data;) >

Now, whenever a DTD writer needs to use a customer in a document, they just need to include the above definition in their DTD using a parameter reference and use the %Common.Customer; parameter entity reference in the appropriate element’s content model. This element may then by managed as if it were its own little document whose base element is the customer element.

Also, whenever a DTD writer is reading a DTD that uses the component, the DTD contains a parameter entity reference wherever the component is used. That is, there is a %Common.Customer; parameter entity in any model where the customer element is allowed. This notation allows the DTD writer to know that that element is defined and managed in another element. Since the potential for components is large, this is a necessary notation for management.

Essentially, DTD components are the starting point for information to be able to be reused. For this information to be reused and consistent, the DTD itself must be componentized into logical units according to the sharing of instance data that is to take place and according to the commonality of different document types.

Instance Components

Instance componentization becomes more of a information-management issue than an SGML issue. Essentially, the DTD componentization will dictate how entities must be created. That is, elements should lend themselves to becoming separate “miniature” documents that can be edited alone.

In the previous example, it would be necessary to create, edit, and proof customer information independent of the documents it is used within. In addition, one may want to produce and proof comprehensive documents such as a list of all the customers sorted by some criteria.

Thus, there are document types that may exist just for the purpose of editing and producing ancillary documents for the maintenance of information. These documents might be the following:

  • A document type for editing a single customer’s data.
  • A document type for listing all the customers in the repository.

Each of these instances can be considered to be a template by which a component can be edited or processed. Their DTD contains only a reference to the component(s) necessary to edit the information and might need a top-level definition to tie them all together.

In the previous example, a document type to edit a customer might look something like the following:

<!DOCTYPE CustomerEdit [ <!ENTITY % Common.Def PUBLIC “-//Copernican//ELEMENTS Common Component//EN”> %Common.Def; <!ELEMENT CustomerEdit – – (%Common.Customer;)> <!ENTITY customer PUBLIC “-//Copernican//TEXT Current Customer//EN”> ]> <CustomerEdit> &customer; </CustomerEdit>

In the above, there are two significant SGML constructs. First, the entity ‘customer’ used to pull in the customer to edit. Second, the use of a “wrapper” element as the base element of the DTD.

The customer entity is used to pull in the currently defined entity to edit. In most systems, a catalog entry can be maintained to point to the current customer entity such that when the document is edited, only the current entity is edited.

The use of a wrapper element is necessary if the customer entity is used. This is because a general entity reference cannot appear before the first start tag has been used. Since the customer is being included via a general entity reference, we must have a wrapper element to allow general entities to be legal.

Semantic Attachment

Whenever SGML is manipulated for any purpose, it must be represented in some kind of programmatic structure. In doing so, the SGML information has some kind of semantic attached to it so that it can be presented or processed. This is the process of semantic attachment.

In the very simple case of formatting, style sheets (FOSI, typesetting code, etc.) are used to attach typesetting notations to SGML structure. As the attachment is made, the SGML information is transformed into some kind of page description language such as PostScript. Since different style sheets can give different presentations, different semantics can be attached to SGML for the purpose of presentation.

Other systems, such as a SGML Web browser, attach display and hyper-linking (HyTime, URL, etc.) semantics to the SGML so that the document becomes an interactive display of the information. In this case, the presentation becomes a “living” representation or “program” that can access other documents or return information to the web server. Essentially, the SGML instance is being executed with a certain set of semantics of which result in the web display that the user is familiar with.

In fact, every time we manipulate SGML we are attaching semantics and executing the SGML instance–we’re just not always doing it directly. We must attribute semantics to SGML to make it accomplish some goal. It is for this reason why SGML is a benefit to many people–the information can be re-purposed many times for different goals without changing the information. This same benefit can be used to allow information to have a multitude of behaviors depending on the circumstances.

In a document-centered SGML solution, SGML becomes the connection between components and components are stored as document instances. Since every SGML document of the same document type shares the same document type declaration, all document instances are related through their DTD.

Hence, any cross-document (cross-unit) information may be retrieved by looking first at what relationships are important in the DTD and then traversing from those relationships to the instances that implemented those relationships. A true document-centric solution can have both entity or hyper-linking relationships within an instance as well as cross-document relationships for a document type.

If you allow this kind of thinking to continue, it is possible to surmise that if you store DTDs and instances in a SGML repository that allows instances to be related to their DTDs and DTD structure to be queried and traversed, you can create applications that utilizes these cross-document and cross-component information much like a traditional database system would without incurring the cost of data extraction and transformation.

It is possible that semantics could be attached to the DTD much like the way semantics were attach to a SGML instance when it was processed. Since instances are related to their DTDs, and DTD can have semantics attached to them, then instances automatically have semantics attached to them because they are an instance of the document type (DTD).

Now, continuing down this line of thinking, one may posit “why do semantics need to be formatting or presentation, why can’t they be anything?” In fact, semantics can be defined to be just about anything. There is nothing in the SGML standard that defines semantics–this is the power of SGML.

Therefore, instead of an application manipulating SGML, we could define semantics for an SGML document such that it manipulates itself. We could define editing semantics that allow a document to be edited in dialog window. For example, a customer component with an name and address could present a traditional dialog window that could edit the name and address. Since the semantics are attached to the SGML, the manipulation is of the document itself and there is no transformation need to get the document back again.

In addition, since a document may have many different semantics, we may define the ability for the SGML repository to store semantic domains such that a document could be executed with different semantic domains for different purposes. For example, in one semantic domain, the document may execute as an editor of itself. When the author wants to print the document, the another semantic domain could be used that makes the document execute as a transformation engine to a page description language like PostScript.

Since documents are applications and applications can be documents there exists a “grayness” or fuzzy region between documents and applications such that the black-and-white division between document information and application need not exist. If the document is the application, we need not worry as much about “can this application produce valid SGML?” since the application is an SGML document an must maintain its validity.

Thus, semantic attachment provides a means by which SGML gets meaning. Meaning can be conveyed to mean anything we need it to be. SGML can then be used as a way to interchange application information and the recipient can use the SGML information in a way that makes most sense to it. Overall, this is the way an SGML-centric and document-centric solutions can accomplish their goals.


Three major points have been made in this essay:

  • Repositories of information can become entities through storage managers.
  • DTD and instance components allow the reuse of units of information.
  • Semantic Attachment allows documents (and repositories) to execute
    as many different kinds of applications.

What this means is that a document-centric system can be created such that a user is always presented with, processing, and using documents. These documents may be in the form of raw SGML to SGML applications in which the user does not even know that SGML is being used. In any case, SGML is glue that pulls many kinds of information together and presents a process-independent way of representing information on which many semantics may be attached.

Since the concept of the document can be extended through semantic attachment to mean many things, document-centric solutions can provide the means to tightly integrate many forms of different information. The World Wide Web is a very good but simple example of this. Thus, SGML applications could extend web-oriented technology to present completely document-oriented solutions.

Post Comment