Data Vault has high goals and promises many improvements:
- Sustainable: The core warehouse remains stable through data integration via business objects/business keys
- Expandable: Hub, link and satellite provide an easily expandable model that reduces the consequential changes due to model changes
- Fast results: By loading unconsolidated data into the Raw Mart, data can be analysed immediately
- Minimal modification effort: Consolidation in the Business Vault only allows quick changes of business rules without having to reload all data completely
- Diversity: In the Business Vault, different views or business rules can be implemented based on the same facts
- Speed: The consistent implementation of the Data Vault architecture allows a high degree of automation
- Throughput and parallelism: Hashkeys ensure high utilisation of parallel processes
Unfortunately, not all Data Vault initiatives can deliver these benefits. What is missing?
Data Vault is a different approach to BI. Depending on what an employee has already experienced, he or she has a different perception of what is different. The word ‘perception’ applies here, often aspects of Data Vault are not perceived as new. People have been doing it this way or something similar for years and overlook details. Important details. More on this later.
Thus a Data Vault Initiative becomes a change process. There is a simple formula for the successful implementation of change processes: training, support from decision-makers and targeted use of Data Vault experienced employees. However, this is too simple a solution. So what exactly are the challenges of switching to Data Vault?
Challenge 1: Data Vault is different
Data Vault is different. It is a mixture of the new and the old that has been inserted into a fixed structure. All activities from the stage to the Data Mart are to be considered as a whole and coordinated. Sometimes local advantages cannot be used, because otherwise advantages on another layer in the warehouse are lost.
But because much is known and some of it has been a real lifesaver in past projects, the ‘I have an improvement on Data Vault’ effect often occurs. Even before the method is fully understood. The first publications on Data Vault were already made in 2000, and although it is still quite new in some countries, there are many successful implementations. It is working.
And yes, it is constantly being developed: Nevertheless, at the beginning of a Data Vault career, the main question to ask yourself is: How can I solve the problem with Data Vault alone? This is a very good way to learn Data Vault really quickly. Even if you already have 20 years of BI experience. The look should always be: How do I solve this in Data Vault? How do I optimise it within this framework? And am I on the right layer?
One should take this time for the first implementations with Data Vault. It is worth it.
Challenge 2: Different dialects or beliefs
The urge to improve Data Vault is very strong and has led to many dialects, deviations and beliefs. The indicator for this is discussions about the right way with almost religious zeal. In these discussions absolute statements are often made (‘no n-ary links’, ‘no satellites to links’). Unfortunately there are few absolute truths in BI.
Fortunately, every measure taken in the context of data integration has clear effects and thus advantages and disadvantages. Absolute statements are strongly reminiscent of populism, simple truths that cannot keep their promise of simplicity in a complex world. Therefore, assess each solution approach with its advantages/disadvantages as well as the effects on the later layers and then choose the solution. This approach also has the advantage of massively accelerating the understanding of Data Vault.
Challenge 3: Data Vault is only simple on the surface
Looking at the hub, link and satellite, the impression quickly arises that Data Vault is a simplification of the data model. And yet it is stricter than 3NF. So underneath there is a high degree of complexity and classic data modelling, which is particularly noticeable when implementing the links. There are only m:n relationships. A change from 1:n to m:n does not result in a change in the data model. In addition, 1:n relationships automatically become m:n relationships anyway due to the historical storage.
Nevertheless, when accessing the data, care must be taken to ensure that it is only a 1:n relationship at any one time. The validity of links, i.e. the mapping of time, gains a higher significance. If errors occur here, a magic multiplication of the data can occur when joining via these links.
Surprisingly, performance problems can often be solved by cutting the link better by reworking the data model in terms of normalisation. With Data Vault, more focus is placed on data modelling and understanding the content of the data. This is particularly evident in the modelling of the links.
In addition to the link, there are other challenging topics in the Data Vault that lie dormant beneath the surface. Which topics these are depends on the current state of knowledge. In addition to the Data Vault training courses with certification from scalefree or genesee academy, there are many sources for Data Vault on the web. A compilation can be found at https://datavaultusergroup.de/data-vault/links/. The most active blog about Data Vault is http://roelantvos.com/blog.
Challenge 4: Adaptation of the Data Vault architecture
So while on the one hand the view becomes much more technical, the view of things also expands more in the direction of technology. The architecture of Data Vault has to be applied to the own technology stack.
In Data Vault’s architecture, it is important to place the concrete actions exactly where they are intended to be. Only hard business rules are implemented in the Stage, the data is loaded unchanged into the Raw Vault. Soft business rules – the changes to the data – are applied in the Business Vault just as much as data cleansing, data quality measures and the calculation of basic key figures. In the (Data/Raw/Information) Mart, the data is provided in the desired form. As a result, all complex activities are moved to the Business Vault. How is the overview kept there?
An architecture in Data Vault should not only contain the distribution of the layers to the respective systems, but also define which actions take place in which order in which layer. Within Data Vault there is some freedom in this respect. These freedoms are necessary for optimal utilisation of the systems used. There is not one way that runs equally optimal everywhere.
This architecture and the documentation of the implementation shall be defined in development guidelines. In this way, a development environment for the data warehouse is created and the development team does not have to redesign and consider everything for each data source. For the adaptation of the architecture and the creation of a development environment, a separate budget is also required – within a project or across projects.
Challenge 5: Data Warehouse Automation
Regardless of whether a tool for data warehouse automation is used or whether a separate generation of loading patterns is developed: Data Warehouse Automation is only successful if the procedure – the implementation – is standardised.
From the manufacturing industry we know that only what is standardized can be automated. In manufacturing, standardization and automation is possible even with a high number of variants. Within BI, this process is tough. There is a long tradition of pragmatism. Do as little as possible and solve only the current problem. This mindset was born from a tradition of scarce resources and high demand. This pragmatism is a great asset, but unfortunately it also leads to high diversity in the data warehouse and is therefore one of the central sources for the high maintenance costs.
Loading the Raw Vault can be such a standardised process, which can then be quickly and easily automated. The Business Vault based on this provides variants. But this is only the beginning. Other highly automated steps may follow. For this, additional process steps must be unified and standardised. With the solution approach described above, these processes can be further standardised. The basis for this is training (challenge 1-3) and standardisation (challenge 2-4).
This also applies when using an automation tool. Especially here it is important to define the development processes uniformly and optimised for the tool. Only in this way can the optimum be achieved from the investment. If everyone adds their own additives and special solutions, this not only slows down the implementation but also results in a major maintenance problem later on. The additions and special solutions must be transferred to each new release.
Actions: the simple answer becomes concrete
At the beginning, the simple answer was, as always in such cases, the same as in the past: Training, support from decision makers and targeted use of Data Vault by experienced employees. If we review this article, we can derive actions that will make a Data Vault initiative successful.
Actions 1:Training
After successful training, whether in self-study or with one of the certified training courses, the aim is to implement Data Vault, i.e. to bring an implementation into the production environment for a concrete use case in the company.
The focus should always be on ‘How to do this with Data Vault?’. There is a 20-year Data Vault tradition in which solutions can be found. Fast, proprietary extensions to Data Vault usually fail because of unwanted side effects in other layers.
For solution variants, there are pros and cons, effort and benefit. On this basis, the implementation decision is easy and unnecessary re-design is avoided.
A simple and fast coordination process is needed for implementation decisions. Unfortunately, if a long drawn-out appointment with all those involved has to be set, this unfortunately all too often slows down development, enthusiasm and the will to work. A good description of the solution variants makes this coordination easy, but is often not sufficient on its own.
Targeted experiments are a good way of balancing alternatives against each other. Long discussions are replaced by clear parameters and at the end there is a proof of feasibility. Moreover, experiments provide the necessary pro and contra arguments. Andrew Hunt speaks of tracer ammunition in his book ‘The Pragmatic Programmer’. We make a first simple shot in that direction and observe whether the solution is viable. For example, two patterns can be measured against each other in direct comparison in a simple implementation.
The results of these experiments should be recorded on a project website or wiki. This way, these experiments do not have to be constantly redesigned. A regular exchange about what was helpful and which solutions were rejected helps to disseminate knowledge. This can be a simple exchange of knowledge or a regular retrospective. Documenting these sessions is by no means an end in itself. It is often sufficient to document the agenda, the speakers, the participants and the slides used or the whiteboards/flip charts photographed. Then it is clear who you can ask about this and no watertight documentation is needed. The exchange of experiences as a living process of experimental learning is more important than complete documentation.
These activities should lead to a common development environment. Now the documentation becomes important. The development environment must be well documented. It serves as a reference book and helps to introduce new employees. The development environment with the respective guidelines is created based on the decisions made. Making the development process comprehensible and repeatable saves later projects from starting from scratch again and the focus can thus – with much less effort – be on the further development of the development process.
Actions 2: Management
The task of management is to obtain a budget for the development environment and to defend it in case of project difficulties. The development environment ensures a steep learning curve. If this is abandoned, future projects will not be faster. And in the end, everyone expects extra costs the first time, even if this is not admitted. If subsequent projects go well, initial difficulties are soon forgotten.
A budget is also not bad for training and trying things out. However, it is much more important to make this exchange and the experiments possible; to generate an atmosphere in the team that invites people to try things with a high level of goal orientation. In Scrum this task falls to the Scrum Master. It needs someone who explicitly takes care of all obstacles that stand in the way of the project goal and the new development environment.
Action 3: Experienced employees
An experienced employee knows how the implementation should look like. If he only tells his colleagues what the solution should look like, they can only execute. This creates head monopolies. Ideally, the employees themselves should become the experienced employees.
A good coach points out known solutions and helps to evaluate the advantages and disadvantages in this environment. He accelerates the learning curve by offering help for self-help. In this way he develops new experienced employees. The transfer of knowledge is guaranteed.
Conclusion
Data Vault offers many advantages. The problems with adaptation are well known and often lead to loud discussions. It is often forgotten that this is a change. Known patterns have to be abandoned and new ways established. Such changes are particularly difficult when the wealth of experience is already considerable.
A successful Data Vault initiative needs a strong focus on exploring the new. The findings flow into a development procedure and this procedure must be defined and documented as a new development environment. This is the only way to ensure that the knowledge gained is available for the next projects. This development environment will continue to develop with subsequent projects.
Management must ensure that the team can learn. Ideally, this is ensured through a role that addresses the problems in day-to-day business and – of course – through sufficient budgets.