Enjoying Database Administration

20.1.09

So you think you're normal... Part 2 First Things First

First Things First


The first normal form is something I might dare accuse a table of not conforming to without a formal investigation. This is because the first normal form tends to eliminate certain characteristics. Different sources vary on specifically what comprises the first normal form verbatim, and that can lead to interesting interpretations.

One such interpretation comes from interpretations of the first normal form such that no field may contain multiple values; each tuple attribute coordinate must contain only one value. This can often lead to the creation of additional columns rather than additional rows. This is of course due to the normalization mantra of reducing redundancy. Moving into the first normal form on a table containing multiple valued cells does necessitate row level redundancy, and this can seem counter-inuitive especially in the mindset of redundancy diminution and a fervent belief in the innate superiority of single columned (non-composite) keys, trending towards the arbitrarily determinant (i.e. data divorced from deployment derivation). I prefer to think of this part of the first normal form as the nesting and expansion phase.

col1val1:col2val1:col3val1,col3val2,col3val3 \
col1val2:col2val2:col3valN \

becomes

col1val1:col2val2:col3val1 \
col1val1:col2val2:col3val2 \
col1val1:col2val2:col3val3 \
col1val2:col2val2:col3valN \

Another aspect of the first normal form is the need for a unique key for each row. Some column or combination of columns must have a unique value for each row. A unique key is useful for mitigating against update anomalies. This is what I tend to refer to as keying into the first normal form.

This is accomplished by the creation of a surrogate key or identifying a natural key. In most cases the surrogate is the simplest solution to creating a unique key for the row. Surrogate keys are constructs of the database. They are consequences of the database. They are not born from the data. They are often numbers auto incremented for each row. I find dates and datetimes can often be combined with foreign keys to diminish the need for reliance on surrogate keys, as surrogates can become nuisance columns that lead to violations of the ACID principles. I tend to prefer to limit the number of surrogate keys in a database by building schemas around surrogate keys.

As you may have noticed I have something of an aversion to surrogate keys. The reason for this is that they tend to introduce arbitrary dependencies which can complicate later normal forms as well as add a layer of ambiguity to the actual relations between the actual data. In the above normalization example we might achieve uniqueness through a combination of columns.

This is not to say that I am preferential to natural keys or even to multiple column keys. It does however lead into my philosophy on normalization. I believe that it is important to establish as unique keys either a single column or a collection of columns prior to the introduction of a surrogate key. By documenting the natural keys, you are establishing a valuable element to the data dictionary, you are clarifying the relational dependencies native to the data. This is a valuable order of key determination that you will undoubtedly find useful as you progress through the normalization of your tables.

In our nesting and expansion phase we got this as a result:

col1val1:col2val2:col3val1 \ col1val1:col2val2:col3val2 \ col1val1:col2val2:col3val3 \ col1val2:col2val2:col3valN \

We clearly cannot use any single column save for perhaps the third one as a unique key as they repeat across rows. We will labor under the assumption that the business logic precludes the use of column three as a unique key alone because it was in fact the nested valued cell.

Searching these columns shows us repetition on Col2 which might indicate it should be off on another table giving us two tables:

Table 1: col1val1:Tab2FKref:col3val1\ ... and Table 2: col1PK:col2Val1 \ ...

In this case we worked under the assumption that col2 values were smaller in size than the foreign key surrogate that referenced it (This is a practical consideration relevant to disk size. That is to say, if the values in the column are smaller than a key to reference them, they might be best left denormalized for practical purposes, though this is a consideration for higher normal forms). We might decide it best to reference the FK in a m:n relationship something like this:

Table 1: col1PK:col2val1:col3val1 \ ... Table 2: col1PK:col2Val1 \ ... TableCross 1, 2 : tab1fkM:tab2fkN \ ...

Using intermediary tables to achieve M:N-ness, accurately M:X and X:N-nesses, is generally advisable as it allows referential integrity to be preserved in a slightly less laborious manner. In short the first normal form is comprise of two parts, the nesting and expansion phase, and the key determination phase.

No comments:

Post a Comment