Database Design: Handling Unstructured Data
When developing a database, it's often quite clear how to build the data model. Relationships are easy to spot and build upon. Every now and then, however, a troublesome scenario arises: unstructured data. Each module in GainCMS, for example, (ie: a web page, a discussion forum, or photo gallery) has a base structure and a relationship to other modules but has additional properties that differ between modules. How do you create a relationship with a content structure that varies for the different entities contained within?
For those that had reviewed my original database diagram may have noticed a table that seemed disconnected from the rest. It was called "moduleExample". In the past, I have used the approach of creating a table for each module. The modules table would store the name of the table for that module type. Then, as a request came in for a particular object, I'd query the database to retrieve the module type and then query the database again to retrieve the content from the appropriate table.
There are a few downfalls to this scenario.
Doing free text searching becomes more problematic. Indexes have to be created on each table, a search has to be executed on each table, and the results have to be compiled in some manner. Querying the database twice creates more overhead just to get one chunk of data. Versioning can become more problematic as each module then has to keep track of its own.
Adding a new module to the application requires modifying code in various places. New database tables need to be added and search queries need to be updated. These disparate changes can introduce new bugs into the application — something we obviously want to avoid.
XML: A Better Way?
Is it XML to the rescue? I'm not one who uses XML all willy-nilly. It's got to serve a purpose and make my life easier. In this case, it may have done just that.
I've redone my ERD to try an approach that I've recently read about which will actually store all the data for every module in one table. I did this by storing data that is common to all modules as database fields. Nothing extraordinary. However, anything that is module specific gets stored in an XML string in a field of its own.
In this way, the functionality for displaying and storing module data only has to be developed once, indexing only has to be set up once, and no new database tables need to be added to handle new module types.
Why is the XML stored in the database and not on the file system?
I could go either way on this and if you prefer to store the data on the file system, I'm okay with that. I chose to store the data in the database to ensure that all data for a page stays together. As I look long term into issues of replication, it'll be easier to replicate the data if it's all in one place.
How to handle XML
I haven't taken into account the mechanism of handling the XML information. This will be determined once I've decided on a database server and programming language that I intend to develop in. I've chosen this approach because I felt it offered the most flexibility down the road.
If I go with MySQL, I can simply export the content as a string and then transform the XML via a server-side language such as PHP.
MS SQL Server 2000 supports some XML processing. The data would still be stored as a text string in the database but I could use the
sp_xml_preparedocument stored procedure and then use OPENXML to pull out specific nodes.
Oracle and the upcoming SQL Server 2005 support a native XML datatype allowing you to do XPath queries from the database. This would allow me to offset some of the XML processing work onto the database server.
PostGreSQL, unfortunately, doesn't have the greatest documentation, so it's not evidently clear how easy a task it will be to perform XML processing on the database server. There appear to be add-ons for current versions and version 8.0.0 of PostGreSQL might be including XML support directly. If you have any links on the matter, please add them to the comments.
With the database design almost done, a number of the technology decisions will need to be made and the development of the framework should begin shortly.