Monday, February 6, 2012

MDM's Unsustainable Tech Debt

One of the nice side-effects of Agile Integration Software is the ability to get useful master data easily and quickly. For most companies, an enterprise MDM project takes years to achieve value, and with the huge effort required to maintain each step, tech debt can skyrocket. Before the project is halfway done, changes and new data and sources have already impacted the viability of the outcome. For an MDM project to bring the promised inherent integrity that offers value, the tech debt simply cannot be ignored. Every single change anywhere in the MDM supply chain must be accommodated immediately and correctly throughout the interdependent process and data networks that constitute MDM.

Only stagnant companies don't change! And only a handful of data sets in any company will remain stable enough to last through the MDM implementation lifecycle. The bottom line is that with the current approach to MDM and the speed with which data is proliferating, MDM, is a self-contradictory concept, and we are likely to see long term initiatives slowly and expensively committing suicide.

MDM - long time-to-value:
Consider the components of the cost of implementing MDM.
  •  To start with, you can count on a costly (high six or seven figures) software purchase, likely requiring multiple products.
  •  A team of consultants with a range of expertise to implement the multi-year project.
  •  Internal resources to manage, guide, and work with the consultants
Consider the components of value of an MDM implementation.
  •  Assurance that everyone is using the same data for decisions
  •  Quality and correctness of data
Detracting from the validity, therefore the value of MDM
  •  Continuously changing sources and formats of data. A heavy solution will have difficulty responding immediately to these changes, leaving gaps in the validity of the information.
  •  Latency of data availability due to staging data in a data warehouse or other database. With the current speed of business, users and decision makers need data that is as fresh as possible.
MDM - the building tech debt:

Is MDM even sustainable in its current manifestations? With the complexity of an implementation, the accumulation of tech debt begins as soon as the first Master Data is defined. Every step along the implementation path is fraught with instability.
  •  Defining each master data schema that can be everything to everyone who needs the data
  •  Determining the most correct sources for each component of the master
  •  Determining the criteria for correctness of the data
  •  Determining the optimal refresh time
  •  Designing a database or data warehouse entry appropriate to the master
  •  Implementing the integration necessary to populate the master data store ...or,
  •  Defining and implementing ways users can access and aggregate the data components directly from the sourcesThis doesn't include all the discovery and such that the experts and company team must perform. Clearly this constitutes a large, time-consuming effort, that generally is nowhere near agile or responsive to changes in the company, systems, and requirements.
The MDM project is like an armadillo walking down the hill with its tech-debt snowball rolling behind, growing bigger at every step, threatening to consume MDM. Remember that the snowball started accumulating before the first master data was ever used. It is conceivable that the speed of growing tech debt means that the time to value is infinite (never happens)!

Here is another opportunity for rescue by the paradigm of Agile Integration Software.


More on tech debt:
http://agileintegrationsoftware.blogspot.com/2011/04/hows-your-tech-debt.html
http://agileintegrationsoftware.blogspot.com/2011/05/lean-and-mean-beats-sloth.html





Monday, November 21, 2011

Data Quality and your Enabled Enterprise

As a good example of Agile Integration Software, Enterprise Enabler's data quality features and capabilities serve a representative discussion. (http://www.enterpriseenabler.com/)   In the context of data integration, I tend to think of data cleansing and profiling in two separate categories, "batch" and “in transit," or "real time."

Batch - Often this is performed as a first-step-project to an integration implementation to ensure that any existing data that is being used is as correct as possible.  The context of correctness is generally defined by the source for which it exists. When the source is an existing data warehouse, the correctness is usually considered with respect to a pre-defined master data definition.

In-Transit or Real-Time - Once the integration is in place, new data is being generated and flows through the organization and systems via the agile integration framework. This data must be validated as soon as it appears in play, as well as when it is passed to its destination, since the definition of "correctness" is ultimately determined by the target use.

With Agile Integration, the philosophy is to focus on the data required for the purpose of the project at hand. While cleansing/validating an entire database or data warehouse full of data may be important, the chances are that it is not important for any particular integration project.  Addressing the subset needed means a more efficient project and faster time-to-value.

Pre-validating existing data

Using the inherent capabilities of Enterprise Enabler to discover data schemas and objects, one can simply "point" the appropriate AppComm (application communicator) to a database or application that is to become a source to the integration, and the schema or services available are presented. Select the tables, fields, objects, etc. of interest, and grab a sample or the full set of data. In a configured process, the data can be cleaned, validated and standardized using pre-built rules, external tools, or special logic for each unit of data, by field, by record, or by other cross-section.  Rules for logging, notifications, and mediation are configured as part of the process. With this approach, you are focused specifically on the data that will be used for the subsequent integration, and a staging database is not required. Once this process is configured, it can be triggered to automatically run as desired to ensure ongoing monitoring and validation of new data. The results can be fed to a BI tool or spreadsheet for statistical analysis on the data quality ("profiling").

With the AppComm approach, combined with the ability to easily create virtual relationships across disparate sources, cross validation ("matching")  across systems or merging data to enrich it, becomes a reasonable exercise, without having to design and build a consolidated staging database. Of course, if the situation still requires a staging database, there's no more efficient way to populate it than Enterprise Enabler. 

After you have completed this step, the chances are that the new data that will be captured from here forward needs to be cleansed, too. This can be done "real-time" as it is being acquired from the source and passed to the federation and transformation steps of an integration.

Validating data on-the-fly

As is the nature of Agile Integration, Enterprise Enabler offers multiple places where data cleansing, validation, and remediation can be managed within the flow of data through an integration. Some amount of detection of erroneous data is done as a natural part the data acquisition by the intelligent AppComm technology.  Driven by metadata definitions, AppComms check not only for valid data (type, format, etc.), but also for the expected schema.  Additionally,

o    Validation/cleansing rules, pre-built processes or  3rd party tools can be dropped in or invoked for detection and mediation at various points in execution:
·    As soon as the data has been acquired
·    As it is being transformed and merged with other sources
·    After it has been transformed
·    By the destination's Appcomm  before/as the data is being posted (plus transaction rollback and assurance in the case of multiple destinations)
·   Anywhere in the data workflow process surrounding the transformations
o      Enterprise Master System ensures that the data comes from the correct source when an end user invokes a particular piece of information.
o     Since Enterprise Enabler's user interface ("Designer Studio") is tied directly to a copy of the run-time engines, as you design an integration, you can do a trial run from the studio and see a sample of the data for inspection to get an idea of the quality of data you are dealing with.

Still don’t trust your data?

Sometimes there are situations where validation rules just won't cut it. Example: setting hard minimum and maximum values for something coming from a physical processing plant. You may be able to determine a reasonable range, but only with the knowledge of what happened yesterday will  you be able to determine that a "way out of whack" set of numbers are actually due to a disruption at some part of the plant yesterday. Enterprise Enabler has a preview/analysis feature that holds the result data (post transformation and process) just before it is posted to the destination, in a virtual store, only to be released and posted after review and approval by an authorized human being.  That person can do quick tests on ranges, averages, etc. as a gut feel reality check and then fix it if necessary before releasing the data set.

And for those of you who care about data governance

Only an AIS is a single end-to-end integration solution. This means that security can be maintained throughout the integration infrastructure.  Developers and Data Analysts log in with the permissions of their role and group, and anything they build or change is logged with who-what-when stamp. Every object in Enterprise Enabler is locked down in such a way, preventing intentional or accidental diversion or modification of data and their flows through the enterprise.

And what about bad data in your ERP?

My apologies, but I just can't help saying to the ERP vendors, "shame on you" for not taking the responsibility to ensure that the data captured and generated by your system is completely correct.  How could you let that happen? People trusted you!  Ok. Ok.. I'll stop short of calling for an "Occupy ERP" movement.

Alltogether..

With all of the various angles on Data Quality, it’s clear that Agile Integration inherently brings a range of capabilities that are simply not possible with other DQ products. Whether you are looking to correct existing data or ensure the quality of new data as it is created, the fact that the data quality is handled as a natural aspect of integration means a more efficient overall solution.


Thursday, November 17, 2011

Big Data Quality


Big Data means big data quality issues, right?  Well, of course, right.  Big data means more data that can be bad or go bad one way or another.  Big, bad data could have big bad consequences. But just think about some of the ways Big Data may have be in better shape than others.

Big Data
  • is usually captured automatically, without manual intervention
  • often has been gathered over many years, so that the framework for capture and validation at the source has improved and been "debugged" over time. Various standards may also play a role in the data capture and ultimate quality. Examples might be weather related data and GIS data.
  • is often used in ways where analytics and conclusions improve with data volume and errors in individual data become less important.  Data quality is essential for Business Intelligence (BI), but from some perspectives, and some aspects of data quality, DQ may move into the background.  
Big Data from Social Media has some additional considerations.
  • Capture mechanisms are well known. Facebook, emails, Twitter, etc.
  • We know that the quality of information from these is highly questionable - that's the nature, and the beauty of the beast.
  • We also know that they are well structured. For example an email has a very easily determined structure: there is the header, the body, attachments, etc. The content of the unstructured data (body, attachments) can be searched for relevant information and key words. Bad data might be a corrupted attachment or garbled text in the body, but other than that, errors are, almost by definition, not really bad data.
  • What do you/we want from social Media’s Big Data? Mostly the trends of the masses. If you clean it up that very exercise could corrupt the data.
Senile data forgets its source and loses relevance and accuracy

There is an altogether different situation with many of the nouveau trendy Corporate Big Data projects.  In this case, big data is likely to be consolidated data coming from a number of sources, including those suffering from data senility. Senile data has been through the wringer, moved from residence to residence, been "cleansed" and perhaps never saw the light. A data warehouse usually is populated with data from a huge number of sources, and fallible humans have pored through it, run human-defined cleansing and validation algorithms, and then subjected it to manually-programmed integration code.  It is incumbent upon the mining and analysis functions to accommodate assumptions about data quality.

So, as you can see, data quality and cleansing becomes an altogether different problem for Big Data.


Monday, October 31, 2011

Mainframe nearly to the cloud…

Most people know that Salesforce.com is one of the first and certainly most successful SaaS (Software-as-a-Service) applications on the market.  One good thing is that Salesforce stores all the data in the cloud and manages it, eliminating the need for their customers to have the skills and the hardware, software, and maintenance costs  to keep it on-premise.  That good thing is also the biggest  downside of SaaS: the concern that the data is stored in the cloud.   Unfortunately,  companies worry about having their data stored off-premise with very little control over its management, security, and perhaps even accessibility. 
 
Nevertheless, Salesforce.com has a huge customer base and offers business functionality important to every business I can think of. While business sectors like financial institutions and healthcare could easily make valuable use of the  functionality of Salesforce and other cloud apps, the risk and regulatory restrictions make storing their data in the cloud impossible.  
 
These institutions simply cannot make copies of their data or move it to the cloud. The data that is inherent to the functionality of Salesforce may not necessarily be the concern, but often it must be presented to users side-by-side with ancillary data that must come from the company's backend, on-premise systems.

But all is not lost!  Agile Integration Software (AIS) naturally solves this problem by creating federated views from multiple sources and making them available to any application, complete with end user access authentication. Here is the crux of the solution with Salesforce as the example:

1.     Salesforce.com offers the capability of modifying the screens, so anyone who is conversant in doing that can modify a screen to populate the data from an external source. One option would be to configure it to call a web service when the screen is presented or refreshed.

2.    Within a few minutes, an Agile Integration Software, such as Stone Bond's Enterprise Enabler Virtuoso, can be configured, generating metadata that virtualizes and aligns backend data with Salesforce data, and packages it as a web service compliant with Salesforce. Optionally, this would be a bi-directional (Read/write) connection.

3.    When an end user brings up the Salesforce page, Salesforce calls the web service, and Enterprise Enabler Virtuoso accesses the on-premise data live, aligns it with the relevant Salesforce data, and sends it to Salesforce screen. With the bi-directional option,  data can be entered or corrected on the screen to automatically update not only the Salesforce data, but also the on-premise data, assuming the end user has proper permissions to write back to those systems.

Companies have spent millions of dollars over the last few years trying to do this, and with the Agile Integration Software as the basis, Enterprise Enabler Virtuoso was configured in three weeks to incorporate this Salesforce connectivity. Now it is available off-the-shelf so that anyone can implement it in a few minutes or at most a day. 

The diagrams below depict the data residence and flow where on-premise data is required in a Salesforce.com implementation. The first is the common solution where a copy of the on-premise data is made and resides on the Salesforce cloud. I don’t need to tell you the overhead and pervasive concern with doing this.  The second shows the on-premise remaining on-premise, where it belongs, and AIS accessing, federating, and delivering a data view virtually to the Salesforce page.
 

http://tiny.cc/id5h0

Tuesday, October 11, 2011

MDM - Making It Actionable/Transactional as You Define It

How useful is your MDM… really? Does it just sit there in a repository, waiting for your MDM team to update it?

One of the common criticisms of MDM projects is the magnitude of the project and the low ROI. More than likely, you are in the middle of a project with great expectations of value.




Metadata and MDM

When most people think of metadata, the scope is limited. It's a schema that defines a virtual data set, for example. It may include a cross-reference in a lookup table. And maybe it includes definitions of what each element means and what unit of measure it is in.

Then what?

Then you have to add references to where the data ought to come from.

But then what?

You've spent quite a lot of resources defining this. Are you any better off than with the ancient
"Corporate Dictionary?" How do you actually use it?

The most common ways to implement Master Data definitions are indicative of Big Projects:

1)    Define a data warehouse to store the data in, so that it is accessible in the form defined in the Data Master. Once the data warehouse is designed, corresponding integration must be built to populate it from the appropriate sources, aggregating and transforming as needed, as often as necessary for minimal latency.
2)    Write web services to access the data from the sources and make them available as Master Data sets.

When I talk about metadata, I think in terms of representing not only the data schemas but also the metadata that describes where the data is, what part of it is relevant, how it aligns with other data of interest, how you or the real or virtual destination (master) needs to see it, and how it must be converted, or mapped, to be meaningful to the destination.

Then there are the events that trigger data flows, and all the surrounding logic notifications, security, and a host of other things. If you can capture all of this information as metadata, in reusable, separable "layers," you will have a highly flexible and "actionable" collection of metadata.

If you define a metadata Master, say, "Customer," for use corporate-wide, you will have several different sources that are in play to ensure that the various parts of the virtual "Customer" definition has the best information from the most appropriate sources. Part may come from your ERP, part from Salesforce.com, and another part from an Oracle database.

· Does your Master definition encapsulate everything you
  need to use the data?

· Can your metadata be pumped onto a message bus?

· Can it be packaged as a web service?

· As an ADO.net object?

· As a SharePoint external content type?

· Does it incorporate the capabilities to perform CRUD
  (Create, Read, Update and Delete) operations at the
  endpoints?

· If one of the sources schemas changes, do you have to do
  anything to accommodate it?

· Do you even need to know a source changed?

If I'm a programmer, I want to leverage the corporate Master Data for my programs and the users of my programs. I can look up the data definitions, sources, etc., and use them, but that still requires a lot of work. When the Master Data includes a full set of metadata, then all I have to do is invoke the web service or External Content Type in SharePoint, or ADO.net and so on. I simply select the Master I need and indicate how I want to use it. I don't need to know what the various sources even are, and if the source changes, I won't need to make any changes, since the metadata will reflect what it needs to. And I can pass that selection process on tot the end user of my application or dashboard.

The diagram above shows the scope of metadata captured for MDM by Agile Integration Software. The metadata is generated from a GUI and has an atomic structure so that a change to any metadata can be made without impacting the whole hierarchy of metadata. Using this type of metadata infrastructure, changes are absorbed without creating waves. Data is accessed directly from the original source, eliminating the need for a costly data warehouse to resolve virtual relationships across sources.

Monday, September 26, 2011

Atomic Architectures for Flexibility and Best Time-to-Value

Big Data definitely doesn't scare me as much as Big Projects. The good thing is that Agile Integration and cloud solutions, along with the pervasive viral nature of Social Media are fueling a shift away from Big Projects and toward incremental atomic approaches with highly reduced time to value.


Historically, Big Projects have been the only way to solve IT problems for Big companies. I've watched "generations" of IT management fall for the "next, next Big Project" promoted by BiG hardware companies, Big Systems Integrators, Big-time analysts. After all, who are you going to trust to set the direction for your Big company? The Big waves always are very well sold, and for the newbies, there is an air of doing something really new and really Big. Of course there's also Big money involved, enough to keep the economy healthy, maybe. As soon as one Big wave of Big Projects are several years in progress, the next Big begins to emerge and put the last one out of business before most are completed. Many stall, are pared way down to the only working prototype, or are abandoned altogether to be replaced with a fresh new Big approach.

Big Projects started long ago, but in the last twenty or so years they have included:

Defining a single corporate database, planning for all the applications to share that same db
Corporate Dictionary - standardizing the data names and documenting the source
ERP - A single comprehensive application means that you don't have to rewrite all the apps to use that db
EAI - to address the reality that the above two Big Projects can't be realized
Business Process Re-engineering (BPR) - Shifting focus from data to processes- Big Consulting Projects with no need to know much of anything about technology
Change Management - because radical BPR created lots of employee issues and confusion
Data Warehouse (DW) - in spite of the intentions, smaller projects and best of breed applications were more successful than Big Projects, and businesses came to rely heavily on those systems. Data Warehouses were supposed to bring all the data together for reporting.
Business Intelligence (BI) - analyze the data in the DW.
MDM - the modern Big Project for a corporate dictionary.

There were others, of course, but you get the idea. Finally wedges are putting crevices in the Big Project and opening it for solutions that are more atomic and less global. Some of the wedges are being driven by:

○ SOA
○ SaaS
○ Agile Integration Software (AIS)
○ Social Media
○ The economy and the imperative for improved time-to-value on projects

These factors open the floodgates for a bifurcation of approaches to enterprise technologies. As Mark Twain said, "If there's a fork in the road, take it." Traditionally Big technologies, like BI and ERP, are now offered as a cloud based service and for single users without the overhead of Big.

These wedges are all eroding the cornerstone of Monolithic solutions. For example, SOA is inherently atomic, with an enterprise solution being a collection of SOAP objects. While the initial SOA initiatives were envisioned as enterprise-wide, in the end even the prototype projects were Big, long, and difficult. If the technology were not built on reusable components, the ongoing work that continues to be done would likely have been abandoned. Similarly, while data warehouses continue to be expanded for Business Intelligence, we are seeing a huge number of BI tools coming on the market for specific use or end user-centric implementation. Cloud computing is also whittling away at Big Projects, with significant cost and time reductions as well as shorter time to value.

One of the interesting things is that this split is creating an environment where emerging technology waves now may have two completely different interpretations, one the old Big approach and the other a more agile and atomic approach.

Take data federation and virtualization, for example. The Big approach is to define a complete (or at least really Big) virtual enterprise data model for federation that acts like a staging database would, and then to implement the integration across and through the virtual staging model. Of course, at some point, it's necessary to define those integrations based on what the end result datasets or use happen to be for the consumers.

The new fork in the road (which I would take) requires no data model, virtual or not. An Agile Integration Software addresses federation and virtualization in an atomic manner, with the end use the initial driving force. Entities that describe , for example, "customer" are defined, the source of record for each piece of the Customer data is identified, and metadata is auto-generated and packaged to grab the data from the sources, federated it "on the fly" and deliver it to the calling program, end user or data workflow on demand or in an event-driven manner. An atomic approach to MDM naturally follows.

Hooray for the fork in the road!

Tuesday, August 23, 2011

Query Optimization across Apples and Oranges


I just recently realized that the problem of federated query optimization that my colleagues and I think about is a completely different problem from the one that has been so well addressed by academics and big database vendors. Even the more contemporary players in the federation and virtualization world don’t extend this concept across disparate sources, and they focus only on run-time speed, but not agility.

 
Those approaches simply do not address the reality that is brought to the forefront now that we have integration solutions that federate everything from web services, spreadsheets, medical instruments, social media, and many other sources, including relational databases in a single "query." The fundamental value of Agile Integration Software (AIS) is violated by the inherent constraints posed by the query optimization tools on the market.


       •        What good to us is a query optimizer that assumes all of the
              data sources are relational databases?

       •        And adding XML to the mix just doesn't "cut the mustard!

       •     What if, in order to use these tools, I have to construct a
              universal data model that includes all of the data that could
              possibly be in play? (The clunky antithesis of agility!)

       •     Do I have to anticipate every data query I might want to
             optimize?

       •    What if there is a lot of transformation that needs to be
             performed along the way to make the data meaningful
             across the sources?


For "pull" integration, where a user's browser interaction or a calling program triggers and specifies the data to be accessed, a SQL query is a universally comfortable way to access information. For a live query in virtual federation, that needs to be interpreted by the federating software into whatever the endpoints understand. The data flowing in from multiple connections needs to be synchronized as the query is being fulfilled from the disparate systems. A "push" integration typically is usually better known, with at least the sources pinned down ahead of time, and often with the exact data being sent each time.

 
In our world, performance is a different problem from typical query optimization on or across relational databases. In complex cross-application joins, the critical path is often more related to the i/o speed of one of the applications or the frequency of disbursement of data, or some other macro factor. The join and access order logic, for example, can be tuned to accommodate the highest resource consumer.

 
So you can see that our problem is not the same one. When people ask us about query optimization, we are sometimes talking apples and oranges!