Database Design: Handling Unstructured Data
In a followup to my previous article on Database Design in the Building a Web Application series, I touch on how to handle unstructured data.
When developing a database, it's often quite clear how to build the data model. Relationships are easy to spot and build upon. Every now and then, however, a troublesome scenario arises: unstructured data. Each module in GainCMS, for example, (ie: a web page, a discussion forum, or photo gallery) has a base structure and a relationship to other modules but has additional properties that differ between modules. How do you create a relationship with a content structure that varies for the different entities contained within?
Multiple Tables
For those that had reviewed my original database diagram may have noticed a table that seemed disconnected from the rest. It was called "moduleExample". In the past, I have used the approach of creating a table for each module. The modules table would store the name of the table for that module type. Then, as a request came in for a particular object, I'd query the database to retrieve the module type and then query the database again to retrieve the content from the appropriate table.
There are a few downfalls to this scenario.
Doing free text searching becomes more problematic. Indexes have to be created on each table, a search has to be executed on each table, and the results have to be compiled in some manner. Querying the database twice creates more overhead just to get one chunk of data. Versioning can become more problematic as each module then has to keep track of its own.
Adding a new module to the application requires modifying code in various places. New database tables need to be added and search queries need to be updated. These disparate changes can introduce new bugs into the application — something we obviously want to avoid.
XML: A Better Way?
Is it XML to the rescue? I'm not one who uses XML all willy-nilly. It's got to serve a purpose and make my life easier. In this case, it may have done just that.
I've redone my ERD to try an approach that I've recently read about which will actually store all the data for every module in one table. I did this by storing data that is common to all modules as database fields. Nothing extraordinary. However, anything that is module specific gets stored in an XML string in a field of its own.
In this way, the functionality for displaying and storing module data only has to be developed once, indexing only has to be set up once, and no new database tables need to be added to handle new module types.
Why is the XML stored in the database and not on the file system?
I could go either way on this and if you prefer to store the data on the file system, I'm okay with that. I chose to store the data in the database to ensure that all data for a page stays together. As I look long term into issues of replication, it'll be easier to replicate the data if it's all in one place.
How to handle XML
I haven't taken into account the mechanism of handling the XML information. This will be determined once I've decided on a database server and programming language that I intend to develop in. I've chosen this approach because I felt it offered the most flexibility down the road.
If I go with MySQL, I can simply export the content as a string and then transform the XML via a server-side language such as PHP.
MS SQL Server 2000 supports some XML processing. The data would still be stored as a text string in the database but I could use the sp_xml_preparedocument
stored procedure and then use OPENXML to pull out specific nodes.
Oracle and the upcoming SQL Server 2005 support a native XML datatype allowing you to do XPath queries from the database. This would allow me to offset some of the XML processing work onto the database server.
PostGreSQL, unfortunately, doesn't have the greatest documentation, so it's not evidently clear how easy a task it will be to perform XML processing on the database server. There appear to be add-ons for current versions and version 8.0.0 of PostGreSQL might be including XML support directly. If you have any links on the matter, please add them to the comments.
Coming Up
With the database design almost done, a number of the technology decisions will need to be made and the development of the framework should begin shortly.
Conversation
I've been running into the same problem lately so this topic has piqued my interest. I was also planning to use PostgreSQL as the database, and store the XML as a string in a table. My thoughts at this point on how to access that data in an easy manner from within Java would be to write a corresponding XML Schema definition for each sort of module's data type, and use the XMLBeans project to turn the XML tree into a tree of POJO Java objects which I can use directly in the code.
I am still in the design on paper phase and this technique is not concrete. I would be interested to hear about other ways I might be able to do this with a minimum amount of pain.
I'm curious... could you post the resource that you mentioned above that had this idea in it?
To be honest, Justin, I'm not sure where I came across the idea initially. But I wanted to be clear that I wasn't being original when I presented the concept. I did a search on Google using keywords like: "unstructured data", "flexible structure", which led to terms such as "storing and retrieving xml" along with the different DBMSs. I went through about 100 sites and a number of discussions on Google Groups.
I've been using this same concept in a slightly different manner. I have a bunch of articles that share a database, but many of the articles have different content (one might be a photo album, another is a simple announcement, another is a job posting, etc.) Most of the data used for presenting 'em in a website is consistent -- date, type, title, "blurb," etc. So that all goes in database fields. Then I have a big ol' text field in the database where I've chosen to store a string of PHP-serialized data rather than XML. (Significantly less processing, more integrateable with the language, but less portable.)
It's been working *very* well, allowing rapid development of new article/content types to fit into our content management system without rewriting any existing code (just the input/output handlers for the article types). I didn't have a source for the concept; it just made sense to do it that way. (However, I very likely read something about it years ago and so had the concept in my mind somewhere already.)
OK so this may be really left field, but. Could the database, be just one XML field (still in a database for speed and search, but optionally filebased) and have the whole templating, delivery mechanism with XSLT.
Second mad idea. What about having the whole site's data available via a RSS link so that you could get mirror and backup for the cost of a small tool.
I guess you can tell I am new to this, but then maybe one day an idea will be of value (grin).
Yours in admiration that you are designing this in public.
Good idea, this XML lark. I'm writing a CMS and am using XML to store flat file cached backups of each page, from which the viewable page is created, but your solution would mean much less proessing to get the data into that format to start with.
I'm not sure what the tradeoffs would be in terms of searching (if someone searched for "XML" would they get every item?) but it's certainly a very interesting way of thinking about things.
One of the possibilities that I thought of to circumvent the search issue is to build a related table that only stores a key to the objects table and a filtered version of the content for filtering. Any searches would be done on the indexed table with the results joined to the objects table to pull out the content for display.
The filtering could remove unnecessary words, xml/html tags, and provide any other filtering that might help improve search results.
I want to know how can we insert record in XML file using web application PHP HTML etc.