“Anywhere between 40% to 90%”. These are unflattering numbers from various past studies about the percentage of Data warehousing projects, which either has failed to achieve all the objectives that it started out to achieve or has failed altogether and shelved completely. From experience, the root cause of most of these failures lie at the very initial conceptualizing of the project. The most fundamental questions are not asked and the urge to deliver a BI solution precedes any logical comprehension of the actual requirement. This is what I call “Putting the cart before the horse” syndrome that affects most of the BI projects.

Introduction

What constitutes this project? This is not an uncommon dilemma at the start of the data warehousing project. The answer to this very important question, from my experience, makes or breaks the entire implementation. I have seen projects run for two years and then eventually break because the answer to this question was not adequately answered. This answer tends to be a bit tricky in Data warehousing implementations , as in the real world , we will be dealing with multiple technologies , integrating in different ways to get the desired project results. In this paper I will try to put together an outline methodology that can help kick start the project.

Define and Scope

What exists already

The ground work is the most important aspect to a successful project. No fancy technologies should be proposed here. Absolutely no jargon here please.

· A simple pen and paper would be your tool of choice.

· Listening and patience will be your virtues.

· Silly questions (trust me won’t look all that silly later) will be the agenda for the day.

Try finding what already exists and what the pain points are. You do not want to be delivering an inferior solution at the end of thousands of dollars and thousands analysis report.

Define Data warehousing scope

Data warehousing is an ocean of capabilities and scope. Varied aspects like setting up the database, creating a dashboard, embedding predictive analysis, optimizing or scheduling all come under the purview of the data warehousing umbrella. Broadly the project may involve any or all of the following

Sizing and database design
Define the data warehousing Model (generally a star schema)
Populating the data warehouse using ETL (or ELT)
Representing the data to the users using BI tools
Creating of aggregation content and optimization

It is very important to clearly define what concepts form a part of this project. This is also the stage where you will map technologies with the requirement. Remember technology must never be a yard stick to restrict the necessary deliverables for a business. In today’s scenario where one technology can seamlessly talk to another, deficiencies can easily be overcome.

Design

Data to Dashboards

This is the stage we start talking code and technical jargon. But before you jump in and start recruiting developers, take a pause and assimilate all the study that is already taken place. These are the questions you need to be clear on

Where is the data coming from?
Where does the data ultimately have to go?
How many parallel threads of development can happen at the same time and what are their dependencies?
Most importantly, what are the success criteria’s from each of the smaller code pieces? Once the smaller pieces are taken care of, generally the bigger picture becomes clearer.

Define Sources and Target

I would consider this snippet as the most important piece of definitions in the context of the entire project. Just listing of the sources and targets are not sufficient here. This is where you will get answers regarding complexity, appropriate staffing plans and the correct testing methodology among others. Let us first split the two and see the critical questions that you need to ask and the parameters that need to be factored in.

Source System

How fragmented is the source?
How many technologies are involved in a source? It can range from simple flat files to complex web services.
How good is the source data in terms of quality? If the incoming data quality is very bad, data quality mechanisms need to be put in place to get meaningful results.
Particularly true when dealing with flat files, you may need to massage the file to get appropriate results. You may choose to do so in the ETL process, or you may do this upfront in the OS itself. For example if these transformations needs to be carried out in a Unix OS, having a resource skilled in UNIX handy, never hurts.

Target System

It is likely you need not build a target system itself. It is likely the BI system access these data sources directly rather than accessing an intermediate data warehouse. This is a decision that would have already been taken in the previous steps. The important factors effecting this decision are mostly likely cost and time for the implementation. In these cases OBIEE's Common Enterprise Information Model is your knight in shining armor. Of course you need to be extra careful while designing you metadata for a OBIEE Presentation layer.

Data warehouse or No Data warehouse, this should lead to a common goal - "This system must answer all the business questions in a reasonable amount of time". This is what should drive a Business Intelligence project, no matter what decision is taken. Probably you should write this down in a post-it and stick it in your monitor. You should then first have a look at this post-it, ponder and then answer the question the developer just asked you.

It is advisable you freeze the target design before you start a parallel thread of designing a BI, else you will end up with a non ending cycle of development.

ETL -Define Transporting Mechanism

This is almost the most undervalued aspect of a data warehousing implementation. More emphasis is always laid on the colorful BI piece of the project and ETL is often overlooked. This may later prove to be single point of failure in the fag ends of the project. Hence extra care needs to be taken at this stage. Here is my suggested simple methodology for design.

Create the first cut of the Data warehouse aka star schema
Figure out how each of the properties defined in the entities would get their data i.e. a Source to Target Mapping
Next look at the post-it you created above. Try to map your business question to actual data and see if that makes sense. If you get all of the questions the very first time, kudos to you, but it is seldom the case.
Redefine your data warehouse
Go back to step 2

At the end of the cycle you must have the fixed source system for your BI front end. During this design you may want to think about Materialized views or Cubes to store aggregate data .Strategies to populate these structures have to be documented here as well.

Do look up data warehousing jargon like CDC, SCD, ELT, Trickle Feed, Partitioning, MDM, Conformed dimensions etc. They may sound complex, but in reality they are very small but important concepts that encapsulate the real world scenarios.

(B)I-cing in the cake

All the hard work done, now you can start with the BI part of your project. This must generally not give you a lot of sleepless nights if all the previous steps are done right. I point to stress here is that if your number of rows that get displayed in the dashboard scrolls more than one page, this is possibly not the solution for you. If you are looking at static kind of reports, you must consider solutions like BI Publisher that are more suited for static reporting. Your BI dashboard should be aimed to cater to immediate and intuitive decision making and must not involve scrolling and interpreting data as in older times.

I have to talk about one question that keeps getting asked repetitively in my session, particularly OBIEE. Where do I define the business transformations? Not a simple question to answer. My answer is always "Depends". Here are the considerations.

Define in the BMM layer of your RPD if the business logic is Complex and reused by many requests
If a logic is a one-off, it makes sense to define it in the front end itself
Not so famous option is to embed this in the Data warehouse itself and populate this in the ETL stage. Use this option if your Business logic remains fairly constant. Also if these derived metrics are used across BI solutions, it is worth doing this extra effort in the backend.

Finally define "What to Access" and "Who will access”. This is taken care of by implementing Security. You really don’t want your colleague to access the HR database and find out your real age, do you? Just Kidding! But in reality only information relevant to the privileges assigned to the user must be displayed.

Remember two things here Authentication and Authorization. These mean 2 different things. Things like LDAP, ADSI and Custom are some of the Authentication methods. VPD and Dashboard web groups signify Authorization.

Conclusion

As we have seen, data warehousing is a process and just not about beautiful dashboards and colorful graphs. There is nothing called a perfect data warehousing solution that fits all. There are not right or wrong answers here and this makes the job of the architect all the more important. I firmly believe a data warehousing technologist is more like a Psychiatrist, only that we earn a lot lesser. The job is more about listening and guiding the customer to the obvious BI solution to achieve their ultimate business KPI's.

Just remember your planning decides if your data warehousing solution stands the tide of time and runs for ten years or your project runs for three years, runs out of funding and closes down .

Happy data warehousing!!!!

Business Intelligence does not need Intelligence

Pages

Tuesday, June 8, 2010

Data warehousing Architect – A Beginners Guide