Enjoying Database Administration
If a bounded surface of infinite area were to be deformed to a lamina would it still be a bounded surface?
What, if any, is the relationship between the area of a bounded surface and its bending curvature?
If a prevailing pattern of inclusion exists and effects the bending curvature of the surface, does such a prevailing pattern of inclusion deform such as to always realize a change in a particular direction?
How many dimensions must be included in a proof to show that holes are not areas of infinite inclusion? Is a hole in a surface an example of infinities canceling each other?
How does one represent a minimal space with mustard?
If bread could be shown to be a bounded surface of infinite area... err... In the physical world described by mathematics surfaces are non-existent, but none-the-less prevail as perceptual conveniences derived from chiral matrices refracting in approximation of the illusion that is reflectivity.
I decided to remedy my ignorance and set forth to find an irrefutably convenient source on the topic with which to shed light upon the darkness of my recently illuminated ignorance. For simplicity sake I'll tell you that the answer is complicated. Apparently in the general case the y BETWEEN z and x statement is interpreted as x >= y >= z, also known as x >=y AND y >= z (I'm pretty sure that my x,y, z's are straight. I like BETWEEN. It is so much more natural and my nagging doubts are so much more easily satisfied with it). Unfortunately the story doesn't end thÆre. It seems that if any of the values in the BETWEEN predicate are taken from columns containing characters from a different character set the predicate will evaluate in Unicode.
On a logical side note I am pretty sure that the conjunctions implied in the BETWEEN and NOT BETWEEN predicates are arrived at by variations on De Morgan's Laws. I'll leave the formalism for another day, as I am still groping with a particular known unknown in the realm of logic. Specifically I am at a loss for the names of the technical fallacies incurred in the use of Boolean derived antecedent propositions in Aristotelian argumentative frameworks. I know it is derived from the interplay between the Law of the Excluded Middle generally present in Non-Hegelian formalisms and the Fallacy of the False Dichotomy. I just am not motivated to wade through endless primers in hopes of finding one source which may hint at the location of an authoritative study of trans-framework logical fallacies.
The senior DBA was perplexed, and after perhaps an hour of muttering swear words under his breath, set about to perform tasks other than staring at the directories and navigating around knowing that the script executed and moved the files but they weren't there. Among these tasks was deleting some files in DB2 managed directories. Wouldn't you know after DB2 managed files were deleted from the DB2 managed directories the files created by the scripts magically appeared.
What happened? Well it seems that these script were being moved into DB2 managed directories by AIX, and not by DB2. At least as far I can surmise AIX does not trigger the update of the file tables managed by DB2, and consequently the files remained "unregistered" in their directories until DB2 was stimulated to update the directory tables.
SELECT CONCATENATE ALL FROM [SUBBSELECT TABLE] WITH [PADDING DELIMITER '
Wouldn't it be just great to concatenate all from a sub-select or table with padding or a named delimiter? Sure if you wanted to get into the nitty gritty WITH PADDING would need a list along with it like:
SELECT CONCATENATE ALL FROM cte_NamedCTE WITH PADDING (col1 LEFT '0', col2 RIGHT '#', col3 RIGHT '-', col4 LEFT 'X')
I sure think it would be boffo and uber, if that sort of statement were included in the SQL standard, or the SQL variant of the vendor that pays me for an endorsement.
The first normal form is something I might dare accuse a table of not conforming to without a formal investigation. This is because the first normal form tends to eliminate certain characteristics. Different sources vary on specifically what comprises the first normal form verbatim, and that can lead to interesting interpretations.
One such interpretation comes from interpretations of the first normal form such that no field may contain multiple values; each tuple attribute coordinate must contain only one value. This can often lead to the creation of additional columns rather than additional rows. This is of course due to the normalization mantra of reducing redundancy. Moving into the first normal form on a table containing multiple valued cells does necessitate row level redundancy, and this can seem counter-inuitive especially in the mindset of redundancy diminution and a fervent belief in the innate superiority of single columned (non-composite) keys, trending towards the arbitrarily determinant (i.e. data divorced from deployment derivation). I prefer to think of this part of the first normal form as the nesting and expansion phase.
Another aspect of the first normal form is the need for a unique key for each row. Some column or combination of columns must have a unique value for each row. A unique key is useful for mitigating against update anomalies. This is what I tend to refer to as keying into the first normal form.
This is accomplished by the creation of a surrogate key or identifying a natural key. In most cases the surrogate is the simplest solution to creating a unique key for the row. Surrogate keys are constructs of the database. They are consequences of the database. They are not born from the data. They are often numbers auto incremented for each row. I find dates and datetimes can often be combined with foreign keys to diminish the need for reliance on surrogate keys, as surrogates can become nuisance columns that lead to violations of the ACID principles. I tend to prefer to limit the number of surrogate keys in a database by building schemas around surrogate keys.
As you may have noticed I have something of an aversion to surrogate keys. The reason for this is that they tend to introduce arbitrary dependencies which can complicate later normal forms as well as add a layer of ambiguity to the actual relations between the actual data. In the above normalization example we might achieve uniqueness through a combination of columns.
This is not to say that I am preferential to natural keys or even to multiple column keys. It does however lead into my philosophy on normalization. I believe that it is important to establish as unique keys either a single column or a collection of columns prior to the introduction of a surrogate key. By documenting the natural keys, you are establishing a valuable element to the data dictionary, you are clarifying the relational dependencies native to the data. This is a valuable order of key determination that you will undoubtedly find useful as you progress through the normalization of your tables.
In our nesting and expansion phase we got this as a result:
col1val1:col2val2:col3val1 \ col1val1:col2val2:col3val2 \ col1val1:col2val2:col3val3 \ col1val2:col2val2:col3valN \
We clearly cannot use any single column save for perhaps the third one as a unique key as they repeat across rows. We will labor under the assumption that the business logic precludes the use of column three as a unique key alone because it was in fact the nested valued cell.
Searching these columns shows us repetition on Col2 which might indicate it should be off on another table giving us two tables:
Table 1: col1val1:Tab2FKref:col3val1\ ... and Table 2: col1PK:col2Val1 \ ...
In this case we worked under the assumption that col2 values were smaller in size than the foreign key surrogate that referenced it (This is a practical consideration relevant to disk size. That is to say, if the values in the column are smaller than a key to reference them, they might be best left denormalized for practical purposes, though this is a consideration for higher normal forms). We might decide it best to reference the FK in a m:n relationship something like this:
Table 1: col1PK:col2val1:col3val1 \ ... Table 2: col1PK:col2Val1 \ ... TableCross 1, 2 : tab1fkM:tab2fkN \ ...
Using intermediary tables to achieve M:N-ness, accurately M:X and X:N-nesses, is generally advisable as it allows referential integrity to be preserved in a slightly less laborious manner. In short the first normal form is comprise of two parts, the nesting and expansion phase, and the key determination phase.
Database normalization is an important aspect of database design, something that is generally done by a data architect, and not as often by a database administrator, much to my delight and chagrin. The rules of normalization are part of the package of database minutiae that I found so intriguing and delightful (pardon the hyperbole, but I find it apropos to style) as to cogitate upon that I decided to become a database administrator. They are of course, rules appropriate to database design, and therefore, not necessarily a business imperative. To be certain data anomalies cost wage hours to repair, and can complicate querying, as well add resource overhead to your database operation. Having justified the time and expense in normalizing your database (or more likely some section of it, perhaps in a new schema as your design has in some areas outgrown its table schema) you are ready to start making somewhat dramatic changes to your database for the better.
In my opinion one of the most important issues to be mindful of when beginning to normalize your design, besides backing up your database (which you should be doing anyway) and design documentation for the existing design, and of course that you should solidify and verify your designs before making any changes to the database, is that normalization should be considered in terms of tables first. Deciding whether or not your entire database, or simply an entire schema for that matter is in any particular normal form, is kind of like deciding whether the sky is a particular shade of cerulean or azure, the answer can, ad will often be, in some places yes, and in others no.
Normalization can be oversimplified by saying that it is breaking tables into more tables. When I look at normalization I usually look for the opportunity to break relations off of tables. But of course this is more of a high level scan for a particular table needing a more formal inquiry into its normality. At this stage I might say I suspect that this table is not fully normalized.
An important aspect of normalization is the minimization and elimination of the potential for data modification anomalies. I tend to think of normalizing a table in terms of these anomalies, as well as in terms of many-to-many relations that can be formed of the type Table1:Key1Key2:Table2, or Subject:Predicate:Object (A great metaphor for the real world relation represented).
The three most important modification anomalies are insert, update, and deletion anomalies. Each of these anomalies corresponds roughly to a particular normal form. The first normal form tends to eliminate update anomalies, the second normal form pertaining to insertion anomalies, and the third normal form deals with deletion anomalies. Of course, there is overlap between the normal forms and relevant anomalies.
An insertion anomaly can be avoided by allowing NULLS, this is to be discouraged, as doing so circumvents first form normalization and consequently prevents all other normal forms from being realized as each is dependent on the previous in a cascading fashion. That is to say that a table in the second normal form is in the first, and a table in the third is in the second, and so on.
In later posts I will expand on the themes introduced here.
The primary problem I ran across was an error that I found most perplexing (a problem with DB2 errors in general), but not one that was merely ambiguous as if often the case with DB2 SQLSTATE errors. No this was an altogether more frustrating animal. I was told by DB2 that REPLACE was not a valid function. What was even more frustrating was that a quick verbatim Google search yielded only seven results none of which was even remotely related to the problem. Also threatening to my sanity was the failure of SUBSTR to work in this situation. Imagine my dismay to find something so magnificently eloquent as:
SUBSTR(REPLACE(REPLACE(REPLACE(Tab.Col, '$', ''), ',', ''), 1, 7) REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(SUBSTR(REPLACE(REPLACE(REPLACE(Tab.Col, '$', ''), ',', ''), 1, 7), '0', 'A'), '1', 'B'), '2', 'C'), '3', 'D'), '4', 'E'), '5', 'F'), '6', 'G'), '7', 'H'), '8', 'I'), '9', 'J')
Did not work.
I was afraid I was going to have to wrestle a beast of a case statement; I just couldn't understand why this wouldn't work. It was perfect in its subtle avoidance of excessive reads, elegant and refined in a way I am too humble to admit, and worse than it not working at all, it only worked for deductions and not for premium. Why god, Why? I nearly mumbled to myself as a sat at my workstation sulking.
It wasn't until I had patted myself on the back for trying to get ahead of the project by going off for some coffee hoping that on this epiphany I would be granted revelation, if only some small glimmer of truth flickering in the corner of my mind's eye to remind me that I am only a junior DBA who forgets simple lessons from school and not a madman who writes scripts to block his SQL from working according to some predictable standard.
As I sat there watching the black Colombian go juice java drip from the filter assembly, I was struck by inspiration. I rushed back to my terminal and without hesitation typed into the search bar "DB2 command 'valid by data type'." Of course I didn't find anything useful. DAMN! I nearly screamed out, why isn't there a listing of commands sorted by valid data type. The thought rolled around in my head as the painfully obvious solution sat the staring blankly back at me blinking with wonder at my inability to fathom it.
Eventually it dawned on that among countless other situations in the world of data where the bidirectionality of a particular notion leads to folly, the reason I had crafted (or copied I don't specifically recall each and every second before the blog was created, or most of them still after for that matter) REPLACE(REPLACE(REPLACE(Tab.Col, '$', ''), ',', '') was to provide compatibility between a string and a decimal representation of a dollar amount, but specifically to make a string compatible with a decimal, and not a decimal compatible with a string. As you probably figured that inner replacement nest needs replacing, and since DB2 doesn't have support for regular expressions (a gripe tangential to my error down an avenue I glanced at longingly) , I was going to have to edit the code rather than write a scriplet to do it for me. The end result was something akin to this:
SUBSTR(CAST(Tab.Col AS CHAR(8)), 1, 7) REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(SUBSTR(CAST(Tab.Col AS CHAR(8)), 1, 7), '0', 'A'), '1', 'B'), '2', 'C'), '3', 'D'), '4', 'E'), '5', 'F'), '6', 'G'), '7', 'H'), '8', 'I'), '9', 'J')
The lesson I have learned from this, is this: If I find myself so lacking in the brain power to solve a problem not present with every instance of a code snippet that I am contemplating resigning myself to stupidity and taking up drooling on myself, I should check to see if the command/keywords I am using are appropriate to the data type (often one of the first lessons in a book on SQL), because there are things that just aren't done like adding J to 8 and getting i^2.
SELECT * FROM SYSCAT.COLUMNS WHERE TABSCHEMA = 'SchemaName';
This will display information about all of the columns in SchemaName.
SELECT * FROM SYSCAT.TABLES WHERE TABSCHEMA = 'SchemaName';
This will display information about all of the tables in SchemaName.
SELECT * FROM SYSCAT.INDEXES WHERE TABSCHEMA = 'SchemaName';
This will display information about all of the indexes in SchemaName.
I will address each of these SYSCAT tables in depth in the days and weeks to come.
Like views MQTs can improve the performance of complex aggregates and crosstable joins. In addition an MQT can include any valid select construct including such things as UNION, INTERSECT, or EXCEPT. MQTs can also realize benneficial functionality with partitioned environments and in data warehousing/BI dimensioning.
A simple MQT can be created similiar to the following:
CREATE TABLE TableName AS
SELECT a.ThisCol, a.ThatCol, a.SomeCol, b.EvenCol, b.OddCol
FROM ThisTable a, ThatTable b
WHERE a.SomeCol = b.SameCol
AND a.ThisCol > 5
SELECT c.ThisCol, c.ThatCol, c.SomeCol, d.EvenCol, d.OddCol
FROM ThisTable c, ThatTable d
WHERE c.SomeCol = d.SameCol
AND b.EvenCol IS NULL
DATA INITIALLY DEFERRED
By specifying the refresh immediate I have made this a system managed MQT on which the data is refreshed as soon as changes are made. This preclude the usage of the MQT for updating underlying tables, but diminishes the personnel overhead associated with managing the database object.
In the case of a full select statement, the FOR READ ONLY prevents modification of the data in the results set. This of course means that you cannot use this clause with an UPDATE or DELETE statement, as those statements call for modification of the data which is prevented by the statement.
In the case of cursors this clause can prevent the inadvertent modification of the table data where it is only desired to modify the data for output.
Lunch converstation ranged across numerous topics including some future demonstration of DB2 Performance Expert and the bennefits we enjoy as an IBM Premier Partner. I am looking forward to expanding my abilities as a DB2 DBA using the Partner Portal tools. We also came across the topic of views and materialized query tables.
Materialized Query Tables are analogous to views in that they are database objects manifested from queries. They pre-populate with data from the underlying select statement and can provide (particularly in the case of system managed MQTs) dramatic improvements in the performance of aggregated values containing queries. A great thing about the MQTs is that users do not need to know about there existence as DB2 will process the optimization plan inclusive of the MQT.
Ultimately, I was inspired to this by a particular extract in which I wanted to select a yes for certain values and a no for certain other values, not producing multiple rows despite the fact that the criteria for selecting yes into the field is only satisfied when another previous row has satisfied the criteria for no, the law of the excluded middle seeming to neccesitate aggregation where scalaricity is more desirable.
I believe a work-around would involve introducing a MAX to the yes or no logic such that y being greater than no yes would supercede no as a scalar output to the aggregated input. None-the-less I have become fascinated (relatively) with the ODBC reserved keyword FOR.
Here is a nifty tool for checking whether a word is a reserved keyword.
iCalendar Source: Microsoft® Office Outlook® 2007 (12.0.6316.5000) SP1 MSO (12.0.6320.5000)
OS Name: Microsoft® Windows Vista™ Home PremiumVersion 6.0.6001 Service Pack 1 Build 6001
Other OS Description: Not AvailableOS Manufacturer Microsoft Corporation
System Name: irrelevant
System Manufacturer: Dell Inc.System
System Type: X86-based
PCProcessor: Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz, 2000 Mhz, 2 Core(s), 2 Logical Processor(s)
BIOS Version/Date: Dell Inc. A06, 2/5/2007
SMBIOS: Version 2.4
System Abstraction Layer Version = "6.0.6001.18000"
Installed Physical Memory (RAM): 2.00 GBTotal Physical Memory 1.99 GB
Available Physical Memory: 820 MB
Total Virtual Memory: 4.21 GB
Available Virtual Memory: 2.31 GB
Page File Space: 2.28 GB
OS Name: Windows Mobile® 6 Standard CE OS 5.2.1236 (Build 177126.96.36.199)
Radio Version: irrelevant (non-radio synch)
CE ROM Versions: Same as build
Available Storage: 44.45 MB
Total Storage: 58.11 MB + <= 2GB
Available Memory: 10.49 MB
Total Memory: 43.68 MB
Processor: OMAP850 201 MHz
The name of this blog is derived from a particular configuration parameter found in DB2. In DB2 9 this word play is extended somewhat by the fact that the SHEAPTHRES_SHR is a "soft" limit. SHEAPTHRES_SHR is IBM 1337 for "Shared Sort Heap Threshold", and represents the maximum size in memory for heaping data to be sorted in a shared manner as opposed to private.