La Plus ca change - Different Year, same BI Issues
I attended a lively panel discussion on "BI Challenges and Success Factors" last week. It was an energetic, impassioned discourse with highly credible panel members who are in the middle of rolling out large, enterprise-wide BI solutions. What struck me the most were the parallels with similar panels I attended and participated in back in the 90's - the key BI technical issues (people issues are a different can of worms) remain ETL (extract, transform, load), and Data Quality. The media can talk about "Big Data" and visualizations all they want, but the overriding challenge for BI professionals in the trenches remains how to get data into their cube, and how to ensure that it is of high quality.
Creating a robust ETL infrastructure requires a great deal of design, planning, and elbow grease. Any BI solution of any value is going to be drawing from multiple data sources. Each of these sources needs to be addressed separately. There are some automated tools that can help with ETL, but in many ways ETL implementation remains a manual process. A BI Analyst has to review the data and figure out how to get it out of the source solution and into the BI cube. There are many factors that have to be taken into account during this process - sometimes the source data is in imperial units and the cube is in metric. Sometimes revenue is stored in thousands of dollars in the source system, but in millions of dollars in the cube. Maybe one source stores time values based on EST, while another source stores time based on PST. All these factors need to be identified and rectified before the data can start flowing from the source solutions to the BI cube.
On the positive side, almost all source data is stored in relational databases these days. On the negative side, each vendor has their own unique "flavor" of relational database. This means that each vendor has a different API (Application Program Interface - a blueprint for building a source/ETL integration) , and they even have different APIs for different versions of a particular database. We live in a heterogeneous world, so building these connections is almost always a major component of any BI project. An organization can build up a strong corporate knowledge base over the years and implement some repeatable processes, but with the number of databases that are storing data for all these solutions (MS-SQL Server, Oracle, mySQL, and innumerable legacy databases), implementing ETL effectively requires a great deal of experience and skill.
Data Quality in BI initiatives usually starts upstream - it originates with the source applications. Missing fields, incorrect fields, and inconsistent fields (e.g. "CT", "Conn", and "Connecticut" all in the same field representing Connecticut) are some of the more common problems. Often a BI Analyst will spend hours, days, even weeks, writing scripts to clean up problem fields.
To fix the problem permanently, the source systems, often legacy systems, need to have data validation added to their code. This could be anything from only letting dates be entered in "mm/dd/yyyy" format on a form, to requiring that all fields on a form be completed before a form can be submitted.
Why do source applications lack proper validation? Well, they're often legacy systems that date back to the days when validation was either difficult or impossible. The other reason is that some enterprise solutions start life as small departmental solutions. A system that was designed for five users who work in the same team ("who needs validation when it's just us using it?") may get rolled out to the whole organization because of its high business value. In these cases, the small systems will show unanticipated weaknesses, like lack of data validation, when they're used at a scale they were never designed for.
Like building ETL infrastructure, improving data quality is a highly skilled, time consuming, high touch endeavor.
What to Do?
ETL and Data Quality are obviously not easy fixes considering they've consistently remained at the top of the "BI Issues" Hot 100 since the dawn of Business Intelligence. Tools for dealing with these issues are getting better, but there needs to be more cooperation across the industry to enable more automation, and better tools. It would be great if ANSI or ISO could improve consistency and standards around relational databases to the point where a unified database API could be deployed.
The data economy requires a robust ETL infrastructure and high data quality to enable all these fancy dashboards, reporting, and analytics that are coming down the pipe. Some of our best minds should be working on these infrastructure issues to make sure the industry can deliver on the ever growing need for BI solutions.