Electronic Dissertations Library

XML: the future of web markup?, by Elliot Pritchard

XML TUTORIAL - ANYTHING TO DECLARE?

A DTD (Document Type Definition) defines what markup tags can appear in an XML document which is associated with it, and the structure in which they appear. A DTD can either be imbedded in an XML document, external to it, or a mixture of the two. An internal DTD will begin with markup of the form '<!DOCTYPE RootElementName [' and end with ']>'. This part of the DTD is known as the document type declaration. Note the exclamation mark - this distinguishes it from other markup. The various other declarations that make up the DTD will fit between the square brackets (we will move onto those in a bit). With an external DTD, all of the various other declarations will be in a separate file, but in the XML file there will be a declaration telling you that. This will be of the form:

<!DOCTYPE RootElementName SYSTEM "RootElementName.dtd">

The last alternative, then, is a mixture of an internal and an external DTD. Here is an example:

<!DOCTYPE ProductCatalogue SYSTEM "ProductCatalogue.dtd" [
<!ENTITY fishcake SYSTEM "aff.gif" NDATA GIF">
]>

Note that between the square brackets there is now a declaration beginning '<!ENTITY'. This is an 'entity declaration', which we will move onto shortly. We would tend to use an external DTD if we want to re-use that DTD for several XML files. We would use an internal DTD if it is specifically for that XML file. A mixture of the two, such as the above, can be useful if we want a core DTD to be usable by several XML files, but we also want to specify something (in this instance an entity) exclusively for that file. There are four other types of declarations that you can find in a DTD: element declarations, attribute list declarations, entity declarations, and notation declarations.


ELEMENT DECLARATIONS

These specify what elements can appear in a document and the structure in which they appear. They take the form:

<!ELEMENT ElementName ('content model')>

The part in brackets termed 'content model' is where we specify what an element can contain (such as other elements, or character data, or both). It would take a slightly different form if the element contained no content. In that instance it would look like this:

<!ELEMENT ElementName EMPTY>

Let us move on to an example of an element declaration:

<!ELEMENT ProductCatalogue (Product+)>

This declaration tells us that there is an element 'ProductCatalogue' that contains one or more instances of the element 'Product'. You will notice the plus sign (+) in that example - it is this that tells us that the element will occur one or more times. There are a whole host of symbols to remember when it comes to element declarations, so we shall go through them all, and then through some indicative examples.

  • No symbol ( ) - An unadorned name means a single instance of that element.
  • Plus sign (+) - There may be one or more instances of this element.
  • Question mark (?) - This element may appear zero times or one time (it is optional).
  • Asterix (*) - This element may appear zero or more times (it is optional and repeatable).
  • Comma (,) - Between element names, commas indicate that those elements must appear in the stated sequence.
  • Vertical bar (|) - Between element names, these indicate that you can choose amongst these elements.
  • Parentheses (()) - You can use parentheses to combine sequences and choices.
  • PCDATA (#PCDATA) - This indicates 'parseable character data'. In other words, the actual information that the document contains. A content model can contain just PCDATA, or a mixture of PCDATA and element names. In the second case it has to take this form: #PCDATA must come first, all of the elements must be separated by vertical bars, and the entire group must be optional.

That is a lot to remember, so we shall work through a few examples to try and make things clear. Here is the first one:

<!ELEMENT ContactDetails (Address, TelephoneNumber, E-mailAddress?)>

This declaration tells us that there is an element 'ContactDetails' that contains firstly one 'Address' element, secondly one 'TelephoneNumber' element, and finally zero or one 'E-mailAddress' elements (i.e. this last element is optional). Here is another example:

<!ELEMENT ContactDetails (Address, (TelephoneNumber 2 | E-mailAddress))>

In this one we see that the element 'ContactDetails' contains firstly one 'Address' element and then either or both of a 'TelephoneNumber' element and an 'E-mailAddress' element. Here is an easy one:

<!ELEMENT TelephoneNumber (#PCDATA)>

This declaration tells us that there is an element called 'TelephoneNumber' which contains 'parseable character data', in this instance a telephone number. And finally:

<!ELEMENT Address (#PCDATA | Postcode)*>

This tells us that the element 'Address' contains either or both of character data and an element called 'Postcode' (if you are in the US, 'ZipCode' would be appropriate).


ATTRIBUTE LIST DECLARATIONS

These specify what attributes can appear in a document. They take the form:

<!ATTLIST ElementName AttributeName AttributeType DefaultValue>

The 'AttributeName' tells us what this attribute is called. The 'ElementName' tells us which element this is an attribute of. The 'AttributeType' and 'DefaultValue' are slightly more complicated. There are several varieties of each of these, which I will detail now, starting with the attribute type. There are six possible types, which are:

  • CDATA - Meaning 'character data'. With this type of attribute, any string of characters is allowed for its value. However, these are not to be confused with the CDATA sections which I mentioned in the last section because markup is recognized with these.
  • NMTOKEN or NMTOKENS - Meaning 'name token' or 'name tokens'. These are like CDATA, but they are restricted in what characters they allow. The attribute values must be made up of letters and numbers and the following select characters: full-stop (.), dash (-), underscore (_), and colon (:). Unlike XML names there are no restrictions as to the order of how all these characters might appear.
  • ID and IDREF - An ID attribute uniquely identifies a particular occurrence of an element. You can then use an IDREF, or IDREFs to refer back to that element. Combined, they are a way of linking parts of an XML document. These are subject to the restrictions of XML names, which I mentioned in the introduction to this tutorial.
  • ENTITY or ENTITIES - These allow entity-references to appear as attribute values.
  • ENUMERATED ATTRIBUTE TYPES - These allow you to provide a menu of options. You write this by typing 'CHOICE' followed by a list of options which are surrounded with parentheses and which are separated with vertical bars (|). Each of these options has to adhere to the restrictions of 'name tokens' (see above). Related to this is what is called a NOTATION ATTRIBUTE TYPE, which is written by typing 'NOTATION' followed by the same form of list, and specifically declares that the element's content conforms to a particular 'notation'. A notation is the format that the information is in, and each notation used in an XML document must also be declared using a 'notation declaration' (which we will come to shortly). All notation choices adhere to the restrictions of XML names.

We will go through example attribute declarations with different attribute types, but let me first describe the four different types of default values that attributes can have:

  • "value" - The default value is simply specified in quotes at the end of the attribute list declaration.
  • #REQUIRED - On every instance of an element it must feature the specified attribute with a specified value.
  • #IMPLIED - An attribute value is not required. As the name suggests though, it does allow the processing program to 'imply' whatever value it feels to be appropriate. An alternative to this is simply to miss out a default value.
  • #FIXED "value" - If the attribute occurs it must have this value, and this value only.

As you may have gathered, there are quite a few variations of attribute list declarations that you can have. We will run through a few examples to clarify the format and also to shed some light on the more complex instances. Here is the first:

<!ATTLIST Product slogan CDATA "The finest in frozen fish">

This declaration tells us that a 'Product' element with an 'slogan' attribute should have a value consisting of character data, and if none is provided (say if one has not been thought of yet) the above default slogan is listed. Here are another pair of declarations:

<!ATTLIST Product number ID #IMPLIED>
<!ATTLIST ProductLink target IDREF #REQUIRED>

The 'number' attribute in the first of these is an 'ID' type. The second of these allows a 'ProductLink' element to link to that particular product by use of an IDREF attribute whose value will be the same as the previous ID attribute. One could use these attributes in a document so that the mention of a product in general company details could be linked to the details of that particular product. Note that not every product is required to have an ID number, but that every product link must specify a value (in order that it correspond to a particular ID value). Here are two more examples of declarations:

<!ATTLIST Fish status CHOICE (fresh | frozen) #REQUIRED>
<!ATTLIST Fisher'sPrice currency NOTATION (sterling | dollars) #REQUIRED>

The first of these tells us that a 'Fish' element can have a status of either fresh or frozen. The second says that when describing the price of fish there is an option as to whether the currency expressed is in sterling or dollars. In both these cases it requires that a value be expressed.


ENTITY DECLARATIONS

Entities, as I mentioned in the last section, allow us to insert information into a particular place, or various particular places, of an XML document. We mark these places by inserting an 'entity-reference' (of the form '&EntityName;') there, and in our DTD we insert an entity declaration to specify what we want to be inserted into that place when the document is processed. An entity declaration takes the form:

<!ENTITY EntityName "entity-content">

There are two main types of entities: internal and external. In the first one the entity-content is contained in the file where the entity-reference is. Here is an example of an internal-entity declaration:

<!ENTITY fishcake "Andy's Frozen Fishcakes">

When your XML document is being processed, every time it comes across the entity-reference '&fishcake;' it will replace it with the text 'Andy's Frozen Fishcakes'. You can also include markup in your entity-content. For example:

<!ENTITY fishcake "<ProductName>Andy's Frozen Fishcakes</ProductName>">

You can use external entities to refer to a different file altogether. Here is an example of an external-entity declaration:

<!ENTITY fishcake SYSTEM "http://www.fish-kingdom.com/aff.xml">

Note the added word 'SYSTEM'. This time when your XML document is being processed, every time it comes across the entity-reference '&fishcake;' it will replace it with the entire XML document at the URL listed. Unless you specify otherwise, the processor will always expect your external entity to be an XML document and will therefore attempt to process it as such. You do not want this to happen, though, if you are linking to something other than an XML document. If this is the case, then you can add the 'NDATA' command to stop the entity-content from being 'parsed' (processed like an XML document). Here is an example of the 'NDATA' command in use:

<!ENTITY fishcake SYSTEM "http://www.fish-kingdom.com/aff.gif" NDATA GIF>

Here the entity-content is a grpahic instead of an XML document. Note the inclusion of 'GIF' (the file-format) at the end of this declaration. This tells the processor the format of the file being linked to. You can also use entities in the DTD itself. These are called 'parameter' entities and their declarations are of a slightly different form. Here is an example of a parameter-entity declaration:

<!ENTITY % fishcake "Andy's Frozen Fishcakes">

Note the percent sign (%) that is used here. The references for this type of entity also use this sign instead of an ampersand (they take the form '%EntityName;'). Other than this difference, parameter entity implementation follows the same form as other entities. Parameter entities can be both internal and external. Entities can be very useful in an XML document. For example if you make a product name the entity-content (such as in above examples) every time you go to use that name in your document you can insert an entity-reference. This way you will have complete consistency every time the name appears. As long as it is correctly spelt in the entity declaration there will be no misspellings of the name in the whole document. And if you decide to change the product name, you need only change the declaration and it will automatically change throughout the document. Another example is when you are trying to handle an extremely large XML document. You can separate your document into separate files, and then use external entity references to insert all the files into one document.


NOTATION DECLARATIONS

If you use notation in your document, you have to declare it in your DTD. These are used to include an external link to an application that can process data in that notation, or alternatively to a formal definition or specification of that notation. These declarations take the form '<!NOTATION NotationName SYSTEM "URI">'. I have included an example of this in the example document following.


EXAMPLE XML DOCUMENT

I have written a DTD for the example document in the last section. I include both together here as an example of a valid XML document.

<?xml version="1.0">

<!DOCTYPE ProductCatalogue [

<!ELEMENT Product (ProductName, Slogan?, Fisher'sPrice)>

<!ELEMENT ProductName (#PCDATA)>

<!ELEMENT Slogan (#PCDATA)>

<!ELEMENT Fisher'sPrice (#PCDATA)>

<!NOTATION sterling SYSTEM 

 "http://www.imf.org/external/np/tre/sdr/drates/0701.htm">

<!NOTATION dollars SYSTEM 

 "http://www.imf.org/external/np/tre/sdr/drates/0701.htm">

<!ATTLIST Fisher'sPrice currency NOTATION (sterling | dollars) #REQUIRED>

]>

<ProductCatalogue> 

	<Product>

		<ProductName>Andy's Frozen Fishcakes</ProductName>

		<Slogan>'The Family Favourite'</Slogan>

		<Fisher'sPrice currency="sterling">1.99</Fisher'sPrice>

	</Product>

	<Product>

		<ProductName>Andy King's King Prawns</ProductName>

		<Slogan>'A Right Royal Treat'</Slogan>

		<Fisher'sPrice currency="sterling">3.99</Fisher'sPrice>

	</Product>

	<Product>

		<ProductName>King Trout</ProductName>

		<Slogan>'A Right Royal Trout'</Slogan>

		<Fisher'sPrice currency="sterling">2.99</Fisher'sPrice>

	</Product>

	<Product>

		<ProductName>Andy's Plaice</ProductName>

		<Slogan>'Hungry? Go To Andy's Plaice'</Slogan>

		<Fisher'sPrice currency="sterling">2.99</Fisher'sPrice>

	</Product>

	<Product>

		<ProductName>The King's Fingers</ProductName>

		<Slogan>'They Come In 'Andy'</Slogan>

		<Fisher'sPrice currency="sterling">0.99</Fisher'sPrice>

	</Product>

</ProductCatalogue>

You should now be able to construct your own valid XML document. If you work through the lists of options for each declaration, writing a DTD is reasonably straightforward. You may be glad to know that we have now finished our technical explanation of XML. If you are ready, let us move on to the next section where we introduce related specifications (or if you would prefer, you can go back to the last section).


Title Page    Next section


XML: the future of web markup?,
MSc in Information Management, 1998/1999
Electronic Dissertations Library
© University of Sheffield - Department of Information Sudies (All Rights Reserved)