Tuesday, August 23, 2011

Query Optimization across Apples and Oranges

I just recently realized that the problem of federated query optimization that my colleagues and I think about is a completely different problem from the one that has been so well addressed by academics and big database vendors. Even the more contemporary players in the federation and virtualization world don’t extend this concept across disparate sources, and they focus only on run-time speed, but not agility.

Those approaches simply do not address the reality that is brought to the forefront now that we have integration solutions that federate everything from web services, spreadsheets, medical instruments, social media, and many other sources, including relational databases in a single "query." The fundamental value of Agile Integration Software (AIS) is violated by the inherent constraints posed by the query optimization tools on the market.

       •        What good to us is a query optimizer that assumes all of the
              data sources are relational databases?

       •        And adding XML to the mix just doesn't "cut the mustard!

       •     What if, in order to use these tools, I have to construct a
              universal data model that includes all of the data that could
              possibly be in play? (The clunky antithesis of agility!)

       •     Do I have to anticipate every data query I might want to

       •    What if there is a lot of transformation that needs to be
             performed along the way to make the data meaningful
             across the sources?

For "pull" integration, where a user's browser interaction or a calling program triggers and specifies the data to be accessed, a SQL query is a universally comfortable way to access information. For a live query in virtual federation, that needs to be interpreted by the federating software into whatever the endpoints understand. The data flowing in from multiple connections needs to be synchronized as the query is being fulfilled from the disparate systems. A "push" integration typically is usually better known, with at least the sources pinned down ahead of time, and often with the exact data being sent each time.

In our world, performance is a different problem from typical query optimization on or across relational databases. In complex cross-application joins, the critical path is often more related to the i/o speed of one of the applications or the frequency of disbursement of data, or some other macro factor. The join and access order logic, for example, can be tuned to accommodate the highest resource consumer.

So you can see that our problem is not the same one. When people ask us about query optimization, we are sometimes talking apples and oranges!




1 comment:

  1. The blog article very surprised to me! Your writing is good. In this I learned a lot! Thank you... Thanks!