Data Elements (The Data Repository)

Each data element should be defined once in the data dictionary and may also be entered previously on an element description form, such as the one illustrated in the figure below.

An element description form example from World’s Trend Catalog Division.

Characteristics commonly included on the element description form are the following:

Element ID. This optional entry allows the analyst to build automated data dictionary entries.
The name of the element. The name should be descriptive, unique, and based on what the element is commonly called in most programs or by the major user of the element.
Aliases, which are synonyms or other names for the element. Aliases are names used by different users in different systems. For example, a CUSTOMER NUMBER may also be called a RECEIVABLE ACCOUNT NUMBER or a CLIENT NUMBER.
A short description of the element.
Whether the element is base or derived. A base element is one that is initially keyed into the system, such as a customer name, address, or city. Base elements must be stored in files. Derived elements are created by processes as the result of a calculation or a series of decision-making statements.
The length of an element. Some elements have standard lengths. In the United States, for example, lengths for state name abbreviations, zip codes, and telephone numbers are all standard. For other elements, the lengths may vary, and the analyst and user community must jointly decide the final length based on the following considerations:If the element is too small, the data that need to be entered will be truncated. The analyst must decide how that will affect the system outputs. For example, if a customer’s last name is truncated, mail would usually still be delivered; if an email address is truncated, however, it will be returned as not found.
1. Numeric amount lengths should be determined by figuring the largest number the amount will probably contain and then allowing reasonable room for expansion. Lengths designated for totals should be large enough to accommodate the sum of the numbers accumulated in them.
2. Name and address fields may be given lengths based on the following table. For example, a last name field of 11 characters will accommodate 98 percent of the last names in the United States.
3. For other fields, it is often useful to examine or sample historical data found in the organization to determine a suitable field length.
  Field Length Percentage of Data That Will Fit (U.S.)
  Last Name 11 98
  First Name 18 95
  Company Name 20 95
  Street 18 90
  City 17 99

Field	Length	Percentage of Data That Will Fit (U.S.)
Last Name	11	98
First Name	18	95
Company Name	20	95
Street	18	90
City	17	99

The type of data—numeric, date, alphabetic, varchar, or character, which is sometimes called alphanumeric or text data. Varchar data may contain any number of characters, up to a limit set by the database software. When using varchar, specifying the length is optional. Several of these formats are shown in the table illustrated below. Character fields may contain a mixture of letters, numbers, and special characters. If the element is a date, its format—for example, MMDDYYYY—must be determined. If the element is numeric, its storage type should be determined.

Data Type	Meaning
Bit	A value of 1 or 0, a true/false value
Char, varchar, text	Any alphanumeric character
Datetime, smalldatetime	Alphanumeric data, several formats
Decimal, numeric	Numeric data that are accurate to the least significant digit; can contain a whole and decimal portion
Float, real	Floating-point values that contain an approximate decimal value
Int, smallint, tinyint	Only integer (whole digit) data
Currency, money, smallmoney	Monetary numbers accurate to four decimal places
Binary, varbinary, image	Binary strings (sound, pictures, video)
Cursor, timestamp, uniqueidentifier	A value that is always unique within a database
Autonumber	A number that is always incremented by one when a record is added to a database table

Personal computer formats, such as currency, number, or scientific, depend on how the data will be used. Number formats are further defined as integer, long integer, single precision, double precision, and so on. There are many other types of formats used with PC systems. Unicode is a standardized coding system for defining graphic symbols, such as Chinese or Japanese characters. Unicode is described in greater detail in a later chapter. There are three standard formats for mainframe computers: zoned decimal, packed decimal, and binary. The zoned decimal format is used for printing and displaying data. The packed decimal format is commonly used to save space on file layouts and for elements that require a high level of arithmetic to be performed on them. The binary format is suitable for the same purposes as the packed decimal format but is less commonly used.

Input and output formats should be included, using special coding symbols to indicate how the data should be presented. These symbols and their uses are illustrated in the table below. Each symbol represents one character or digit. If the same character repeats several times, the character followed by a number in parentheses indicating how many times the character repeats is substituted for the group. For example, XXXXXXXX would be represented as X(8).

Formatting Character	Meaning
X	May enter or display/print any character
9	Enter or display only numbers
Z	Display leading zeros as spaces
,	Insert commas into a numeric display
·	Insert a period into a numeric display
/	Insert slashes into a numeric display
-	Insert a hyphen into a numeric display
V	Indicate a decimal position (when the decimal point is not included)

Validation criteria for ensuring that accurate data are captured by the system. Elements are either discrete, meaning they have certain fixed values, or continuous, with a smooth range of values. Here are common editing criteria:
1. A range of values is suitable for elements that contain continuous data. For example, in the United States a student grade point average may be from 0.00 through 4.00. If there is only an upper or lower bound to the data, a limit is used instead of a range.
2. A list of values is indicated if the data are discrete. Examples are codes representing the colors of items for sale in World’s Trend’s catalog.
3. A table of codes is suitable if the list of values is extensive (for example, state abbreviations, telephone country codes, or U.S. telephone area codes.)
4. For key or index elements, a check digit is often included.
Any default value the element may have. The default value is displayed on entry screens and is used to reduce the amount of keying that the operator may have to do. Usually, several fields in each system have default values. When using GUI lists or drop-down lists, the default value is the one currently selected and highlighted. When using radio buttons, the option for the default value is selected, and when using check boxes, the default value (either “yes” or “no”) determines whether or not the check box will have an initial check in it.
An additional comment or remarks area. This might be used to indicate the format of the date, special validation that is required, the check digit method used (explained in Chapter 15), and so on.

Data element descriptions such as CUSTOMER NUMBER may be called CLIENT NUMBER elsewhere in the system (perhaps old code written with this alias needs to be updated). Another kind of data element is an alphabetic element. At World’s Trend Catalog Division, codes are used to describe colors: for example, BL for blue, WH for white, and GR for green. When this element is implemented, a table will be needed for users to look up the meanings of these codes. (Coding is discussed further in Chapter 15.)

Related: