Thursday, November 17, 2011

Big Data Quality

Big Data means big data quality issues, right?  Well, of course, right.  Big data means more data that can be bad or go bad one way or another.  Big, bad data could have big bad consequences. But just think about some of the ways Big Data may have be in better shape than others.

Big Data
  • is usually captured automatically, without manual intervention
  • often has been gathered over many years, so that the framework for capture and validation at the source has improved and been "debugged" over time. Various standards may also play a role in the data capture and ultimate quality. Examples might be weather related data and GIS data.
  • is often used in ways where analytics and conclusions improve with data volume and errors in individual data become less important.  Data quality is essential for Business Intelligence (BI), but from some perspectives, and some aspects of data quality, DQ may move into the background.  
Big Data from Social Media has some additional considerations.
  • Capture mechanisms are well known. Facebook, emails, Twitter, etc.
  • We know that the quality of information from these is highly questionable - that's the nature, and the beauty of the beast.
  • We also know that they are well structured. For example an email has a very easily determined structure: there is the header, the body, attachments, etc. The content of the unstructured data (body, attachments) can be searched for relevant information and key words. Bad data might be a corrupted attachment or garbled text in the body, but other than that, errors are, almost by definition, not really bad data.
  • What do you/we want from social Media’s Big Data? Mostly the trends of the masses. If you clean it up that very exercise could corrupt the data.
Senile data forgets its source and loses relevance and accuracy

There is an altogether different situation with many of the nouveau trendy Corporate Big Data projects.  In this case, big data is likely to be consolidated data coming from a number of sources, including those suffering from data senility. Senile data has been through the wringer, moved from residence to residence, been "cleansed" and perhaps never saw the light. A data warehouse usually is populated with data from a huge number of sources, and fallible humans have pored through it, run human-defined cleansing and validation algorithms, and then subjected it to manually-programmed integration code.  It is incumbent upon the mining and analysis functions to accommodate assumptions about data quality.

So, as you can see, data quality and cleansing becomes an altogether different problem for Big Data.

No comments:

Post a Comment