Principles provide quick orientation. Like a compass, they point the way. With a compass, there is often an obstacle on the direct path to the goal. With principles, too, the solution is not always on the direct line and obstacles have to be skilfully overcome. Conveniently, the ways to get around obstacles are also standardised. The principle delivers what it promises: quick orientation with a sustainable solution.
The advantage of clear principles in an architecture is that the resulting software products are easier to understand. A user interface, for example, is only understandable if the ‘look and feel’ is standardised and follows clear principles. Whenever something does not follow the normal pattern – the principle – the user is irritated and takes longer to find the solution.
These four principles provide a very good orientation for a data warehoouse:
- Separation of Concerns and its close relative the Single Responsibility Principle
- Always load all data
- Single Version of Facts – Multiple Truths
- Calculate KPIs only once
The basis of an IT architecture is the separation into independent tasks. There are several well-known principles for this: Separation of Concerns by Edsger W. Dijkstra (https://en.wikipedia.org/wiki/Separation_of_concerns) or the Single Responsibility Principle by Robert C. Martin (https://en.wikipedia.org/wiki/Single-responsibility_principle). This is not just about breaking down functionality into smaller portions. It is about dividing the functions into meaningful blocks so that they have only one purpose at their core. The idea of the Single Responsibility Principle stems from the class design of object-oriented development and explicitly speaks of ‘only one reason for a change’ in the definition.
The division is therefore based on concerns. For a data warehouse or any other form of data preparation, there are the following three concerns:
First, the data is collected (gather). The data is then processed on this database, i.e. correcting the data and drawing new insights from the data (apply). The expanded database is then made available to different recipients in different forms and ways (provide).
This division should not be confused with the debate about the number of layers in a data warehouse, it is not a plea for 3 layers. It is only a question of which concerns are pursued.
Each of the three steps follows its own principle.
Data Integration – Always load all data
99.999999% Up-Time of the process – failing only on hard-errors. A process should not fail because of DATA errors.
Linstedt, Dan. The Official Data Vault Standards Document (Version 1.0)
(Data Warehouse Architecture). DanLinstedt.com. Kindle-Version.
Always load all data means also loading faulty data. At first this sounds like a bad idea. On closer inspection, it quickly becomes apparent that faulty data can hardly be detected.
If the data has the right structure, i.e. can be loaded into database tables, then correct data cannot be distinguished from incorrect data. From this point on, the quality of data can only be determined on the basis of data quality.
Imagine that a data set is not loaded because an important attribute is not filled. All evaluations that do not need this attribute are now wrong. If you correct or exclude individual attributes or data records, you change the data and thus the result of the evaluation. If it turns out afterwards that the data was correct in the source system after all, the original state of the data must be restored. This can lead to a complete reload of the data and take a lot of time. If, on the other hand, the correction is made after loading the data, only the correction has to be repeated.
Without a clear statement on data quality, no good decision can be made here. The assessment of data quality can change due to a change in the source system or due to a change in data capture or in the accompanying processes. The assessment of data quality is a business task.
The quality of the data is therefore not the responsibility of the data warehouse. Therefore, a correction should take place in the source system if possible. Without the correct case reference and without the possibility to consult with, for example, a customer in certain cases, a correction of data is always just a guess. And even a well-founded guess would still be a guess.
A fundamental improvement in data quality can only be achieved at the source, when the data is created. Data quality can be monitored in the data warehouse. However, this can only be done at the request of the business department. The data quality processes should not slow down the provision of data.
This principle has a great impact on the handling of data in the data warehouse. Whenever certain assumptions about the data are made in a procedure, these assumptions must be checked. An example: Customers in the web shop are to be evaluated. For this, a customer score is needed, which is calculated from the payment method and other attributes. In this case, it cannot be assumed that a payment method is always available, even when a purchase has been completed. When calculating the score, ‘No payment method available’ must be explicitly calculated and an entry ‘No payment method available’ must exist in the payment method dimension. This is the only way to ensure a correct calculation in the coming years. If this assumption is also important for the department, an additional evaluation of the data quality can provide information about missing or incorrect payment types in general.
Depending on the number of all errors in the data for a report, one can add a score about the data quality to a report or to a fact table. With this score, the report consumer knows how reliable the information is.
The data quality score shows that data quality is not the responsibility of the data warehouse. If adjustments need to be made to the loading procedures into the data warehouse, these can be carried out at leisure after the reports have been made available. The conflict cases are known and evaluations that are not affected by them are available.
Data Valorisation – Single Version of Facts – Multiple Truths
This principle goes back to Ronald Damhof. In his article ‘The next generation EDW – Letting go of the idea of a single version of the truth‘ his basic statement is: If there is no single view of the data in a company, why should it be implemented in the data warehouse of all places. His suggestion:
The goal is therefore to create a system, i.e., a ‘single version of the facts’, which is open to individual interpretation.
This principle builds on the first principle and extends it. The data is stored in a historical form without any changes to the content. Adjustments are then made to the data content on this uniform version of the facts. The data becomes more valuable, hence the term data valorisation.
The adaptation of the data is always done only at the request of the business department. It does not matter whether it is a correction or preparation of the data. Moreover, if two departments cannot agree on a single solution, each department will get their own solution. Each DWH customer gets the data that is needed to do their own job. The BI team cannot make any decisions about the usefulness of business requirements.
The beauty of this solution: if the specifications for the calculation change, the complete data does not have to be reloaded. The uniform version of the facts remains unaffected. Only the parts affected by the change have to be loaded. This saves a lot of time when implementing the changed specifications.
Data Valorisation processes can become very complex. Individual process steps might build on each other, might have interdependencies or might be completely independent. Here is a possible example of such calculations: One part of the evaluations is done on the generalized partner model, while another department extracts customers and suppliers from the partner model.
The ‘single version of facts’ principle avoids long discussions about uniform data with the respective business departments. In addition, maintenance and further development are much easier when specifications are changed, as only the relevant parts need to be adapted and reloaded.
Data Delivery – Calculate key figures only once
Regardless of the actual type of reporting, a data warehouse always has the task of measuring something by its dimensions. It is always about key figures and dimensions, regardless of whether the data is stored in a large flat file or in a cube.
KPIs should always deliver the same values, regardless of the tool used and the actual report. If the KPI is calculated only once, there can be no deviation. Unfortunately, this is not possible for all ratios. Ratios, for example, can only be calculated during evaluation, since numerator and denominator depend on the choice of dimensions and must be summed up separately before calculation.
Other measures are created through aggregation to specific dimension values. Like the ratios, these measures are also based on existing measures. The calculation is relatively simple and includes the basic arithmetic operations in addition to aggregation. The basis for all these calculations are measures calculated from the data structures of the DWH (base measures). By selecting only these base measures from the DWH and then using them as the basis for the other measures, you can avoid multiple calculations and keep deviations to a minimum.
The separation of key figures according to their origin:
- Base measures – are calculated from the core warehouse for the fact tables. These are KPI such as the number of articles sold or turnover.
- Derived measures – are always calculated on base measures, other derived measures, and dimensions, for example turnover with young people, turnover with pensioners
- Calculated measures – are best calculated in the frontend reporting tool. Either because they are ratios or because they can only be calculated if the context (dimensions) is clear.
In a dimensional model, facts are stored at the smallest possible granularity. Thus, the base measures are ideal for the original fact tables. The derived measures are created through aggregation and combinations of these fact tables. Ideally, the derived measures are already included in the export to the front-end tools. This way, only the calculated measures are determined during the evaluation.
Nevertheless, in such a system there can be deviations in the KPIs across the various front-end tools. Especially when Excel is used excessively with many own calculations, the risk of changes in measures is high. The publication of the base measures as simple benchmark values from the first fact tables allows every consumer of the data to check quickly and easily whether the present evaluation is plausible.
Follow-up work on these principles
Principles are nice because one can use such principles to derive decisions for implementation. Ideally, these decisions lead to the same patterns in the implementation. This effect can be strengthened by incorporating the principles into an architecture. This requires an allocation of the 3 phases to layers, since the persistence of data can bring its own requirements. For this architecture, the exact steps that are necessary during a phase or for the creation of a layer must then be defined. This definition is again done in the sense of the Separation of Concerns.
This assignment of tasks to a layered architecture is the central element of the DWH development guideline. It describes how the principles are to be adhered to. In this way, the development guideline ensures that the team does not always have to deal with the same problems during implementation.
A data preparation process and an implementation process are created in the development guidelines. Based on these processes, the continuous improvement of productivity and quality of the DWH team can be controlled transparently.