Extensible markup language (XML) is a language that can be used to exchange data between businesses or between systems within a business. It is similar to HTML, the markup language used to create Web pages, but is more powerful. HTML is concerned primarily with formatting a document; XML addresses the problem of sharing data when users have different computer systems and software or different database management systems (for example, one company using Oracle and another using IBM’s DB2). If everyone used the same software or database management system, there would be little need for XML.
Once an XML document has been created, the data may be transformed into a number of different output formats and displayed in many different ways, including printed output, Web pages, output for a handheld device, and portable document format (PDF) files. Thus, the document’s data content is separated from the output format. The XML content is defined once as data and then transformed as many times as necessary.
The advantage of using an XML document is that the analyst may select only the data that an internal department or external partner needs to have in order to function. This helps to ensure the confidentiality of data. For example, a shipping company may receive only the customer name, the address, the item number, and the quantity to ship, but not credit card information or other financial data. This efficient approach also cuts down on information overload.
XML therefore is a way to define, sort, filter, and translate data into a universal data language that can be used by anyone. XML may be created from databases, a form, or software programs, or it may be keyed directly into a document, text editor, or XML entry program.
The data dictionary is an ideal starting point for developing XML content. The key to using XML is creating a standard definition of the data. This is accomplished by using a set of tags or data names that are included before and after each data element or structure. The tags become the metadata, or data about the data. Data may be further subdivided into smaller elements and structures until all elements are defined. XML elements may also include attributes, an additional piece of data included within the tag that describes something about the XML element.
Figure below illustrates a data dictionary containing customer, order, and payment information. The overall collection of customers is included in what is called the root element, customers. An XML document may contain only one root element, so it is often the plural of the data contained in the XML document. Each customer may place many orders. The structure is defined in the two left columns, and the XML code appears on the right. CUSTOMER, as you can see, consists of a NAME, ADDRESS, CURRENT BALANCE, multiple ORDER INFORMATION entries, and a PAYMENT. Some of these structures are further subdivided.
The XML document tends to mirror the data dictionary structure. The first entry (other than an XML line identifying the document) is customer, which defines the entire collection of customer information. The less than () and greater than () symbols are used to identify tag names (similar to HTML). The last line of the XML document is a closing tag, /customer, signifying the end of the customer information.
Customer is defined first and contains an attribute, the customer number. There is often a discussion about whether data should be stored as an element or an attribute. In this case, they are stored as an attribute.
The name tag, name, is defined next because it is the first entry in the data dictionary. NAME is a structure consisting of LAST NAME, FIRST NAME, and an optional MIDDLE INITIAL. In the XML document, this structure starts with name and is followed by lastname, firstname, and middle_initial. Because spaces are not allowed in XML tag names, an underscore is typically used to separate words. The closing /name tag signifies the end of the group of elements. Using a structure such as name saves time and coding if the transformation displays the full name. Each of the child elements will be on one line separated by a space. Name also contains an attribute, either I for individual or C for corporation.
Indentation is used to show which structures contain elements. Note that address is similar to customer, but when we get to order_information there is a big difference.
There are multiple entries for order_information, each containing an order_number, order_date, shipping_date, and total. Because the payment is made either by check or credit card, only one of these may be present. In our example, payment is by check. The dates have an attribute called format that indicates whether the date appears as month, day, year; year, month, day; or day, month, year. If a credit card is used to make a payment, a TYPE attribute contains either an M, V, A, D, or an O indicating the type of credit card (MasterCard, Visa, and so on).