Thursday, May 25, 2017

Keep it Fresh and Easy

I've seen many data warehouses fail due to reconciliation, performance, time lag and lack of data confidence issues. All of which could be avoided with better designs.

Data typically comes into data warehouses in the format of the source system. I like to call this first layer of data as the "base data". In my view it should have similar structure and be easy to reconcile with its source system. A good practice is to frequently run reconciliation jobs that compare your base data with source system data. This will give confidence the base data in the warehouse accurately represents the source systems. If you go on to build fact tables and dimensions from these base data tables, you are complicating reconciliation of this information with source systems. Consider the common issue of a user questioning a number in a report. To reconcile the value of the figure back to the source systems, you'll need to interpret the ETL code that built the fact/dimension tables (over and above the ETL code that imported the base data). This code is not with the data. You'll need to determine which ETL (SSIS) packages and which SQL procedures created/updated the tables. Even when you find the ETL package, you cannot be sure there was not another process or manual update that changed the data. Also, maybe the packages/procedures have changed since the data was loaded? It really can be challenging to reconcile these materialised fact/dimension tables back to the source systems. If you cannot reconcile the report, you'll struggle to win data confidence with the users. The base data tables holding dimensions are typically stored as Type 2 dimensions. They are easy to reconcile (IsCurrent=1) with source systems, even though they hold historical (Type 2) information.

Now consider simplifying data lineage by creating views (over base data tables) for the fact and dimension "tables". You can unmistakeably reconcile the figure in the view with the base data tables, and these base data tables are easy to reconcile (and have been) with the source systems. All you need to do is read the logic in the view. So, when a user says, I don't think the number should be 995, you can explain exactly where the 995 comes from in the source system(s).

Sometimes it is challenging to use views to construct user fact/dimension tables. The most common challenge with views (with all the necessary transformation logic) is performance. Fortunately, there are a few techniques you can use to mitigate the cost of the transformation logic. Note, I would always encourage DW developers to use views early on in prototyping business intelligence, as they are extremely flexible and support customisation while users give feedback on what information/transformation they really need.

Techniques to improve performance on fact/dimension tables as views.
  1. The biggest need for materialising fact/dimension tables is to generate and store surrogate keys (PKs in dimensions and FKs in facts) for the type 2 dimensions. It is very costly to join a fact table with a type 2 dimension using a business key from the fact table and a fact date that is scanning a range in the dimension table. However, consider storing the EffectiveFromDate of the dimension in the fact table, when it is loaded. This is a one off cost, as the fact data is "incrementally" loaded from the source system. That allows the fact table to be joined with the type 2 dimension table with an equi-join. Ie. dim1.BusinessKey=fact.BusinessKey AND dim1.EffectFromDate=fact.dim1_EffectFromDate. Note, this does require your cube/BI tool supports composite primary keys. Surely all good cube technologies support composite keys?
  2. Judicious use of left outer and inner joins. Left outer joins are not implicated if the query does not request columns from the right hand table and the right hand table has a primary key on the join columns. Ie, SQL optimiser knows the left side can only join with one right side row, and since the row isn't used, the join is unnecessary. So, you should ensure the right side tables of outer joins have primary key (or unique) contraints.
  3. Create a clustered or filtered index on Type 2 tables current records. If you have a flag column, IsCurrent, then consider using it as the first part of the clustered index, or creating a filtered index WHERE IsCurrent=1. That way, any views that are looking for Type 1 records will have all the records together on contiguous pages.
  4. If you are merging data from multiple source systems, your fact and dimension tables will use UNION ALL for each source system table and it will include a SourceSystem column, which is simply a literal in the view. Having it as a literal in the view takes no space and costs the query nothing. When the user filters on SourceSystem='Source System 1', the SQL optimiser can see that in the view and will only read rows from that part of the UNION ALL query in the fact/dimension view.
  5. Use a cube technology, such as SSAS, that reads the fact data once, and thereafter incrementally, which avoids repeated reading the potentially expensive views.
  6. Use column store indexes on the base data tables, or, at least PAGE compress large transaction tables.
  7. Commonly used transformations could be computed columns in the base table. For example, rather than seeing 'Db' or 'Cr' some users might prefer to see 'Debit' or 'Credit'. The code for this is very simple, and it could be a computed column in the base table, taking no space and costing almost nothing to display. Every developer using the base table will see the computed column, so there is high probability the code for DebitCredit will be in one place and reused. This column has no data space costs, it is simply computed when the user query requests the column. The only performance cost is where the user filters by this column, as it cannot be indexed unless persisted. Note, it is possible to mark the column as persisted, in which case an index could be built on the column. Also note, an index on DebitCredit would typically not be useful due to the very low cardinality. Some of these computations will be more complex than a simple debit/credit, but having them as computed columns makes the transformation unmistakeable. It is right there with the code, not updateable from a random procedure or update process. Further down the line a computed column could be persisted and indexed with no change in their use from views.
  8. Date tables often have lots of computed columns for WeekdayNo, WeekdayName, MonthNo, MonthName, Year etc. There is no need to have these columns "persisted" unless they are commonly used as filters and the SQL optimiser would use an index on them. If, at a later stage, you determine that it would be beneficial to persist WeekdayName, for example, this would require an update to the table to convert WeekdayName from a computed column to a physical column as WeekdayName is not deterministic. This change to the table would change its use in views and queries.
People who know me, know of my affinity for virtualisation in data warehousing. I have had a lot of success creating virtual data warehouses directly on top of OLTP databases. Unfortunately, many text books have been bought and sold based on data warehouse designs for "star schemas", "snowflake schemas", etc. These concepts certainly have their place, but such vast data materialisation has significant costs. If you can deliver information to users without rematerializing facts and dimensions, you are generally on a winner. I have helped build business intelligence solutions with cubes that consume views over OLTP transaction data for KFC, Long John Silvers, McDonalds, Pizza Hut, Starbucks, to name a few. Since the cubes are incrementally processed, the overhead on the OLTP application is low, latency is low, data confidence is high, and information is vast. These data warehouses have been constructed by very small teams, as there is literally much less work to do. As an added benefit, these data warehouses are extremely flexible due to their simplicity and transparency.
If I were to draw an analogy, it would be with a salad bar. Creating a set of salads in advance would be like building fact/dimension tables, while creating bespoke salads would be like views. If the salad has been built in advance, it is difficult to be 100% sure of the ingredients, as you need to know the recipe and you need to be confident someone hasn't added ingredients subsequently. Also, the prebuilt salad gets stale from the time its built. Taking the analogy even further, getting fresh vegetables from the garden to make your salad is like building virtual data warehouses directly over the OLTP database. I prefer bespoke salads as they are always fresher and you know exactly what's in them.

No comments: