Connecting a new data source for the data warehouse should be a simple process. However, each source system has its own requirements for exporting and transporting the data. In addition to that, there is no standard for the design of an interface to BI.
So how to keep an interface simple? Or is there a need for a complex solution that meets all requirements? Starting from a simple interface, typical problems of a data interface are added and gradually considered while changes are made to the solution. This creates a decision path for the connection of a new data source and the implementation of its interface.
<h1>Terminology</h1>
An interface to a data source can be determined by the type and scope of the delivery.
Type
|
Scope of delivery
|
This results in the following possibilities:
The simple interface
A simple interface exports data at the source and imports it into the target database. No detection of delta, the data is pulled at a certain time (snapshot).
This interface becomes even easier if the copy can be made directly in the database, i.e. from one instance of the DB to the other. If this is not possible, a routine for importing data must be found. Unfortunately, not all databases have a suitable solution for this.
This simple interface provides no information about insert, update or delete of records. The load patterns for Data Vault ensure that only new or changed data is transferred. However, in the resulting data vault table, we see that a satellite entry for a certain business key has no changes for this load date timestamp. Whether this is a delete or just an unchanged record cannot be detected on a single satellite record. One does need the information if a business key was delivered, to be sure that it was indeed deleted. In Data Vault this is done by the ‘record tracking satellite’. It is linked to a hub or link and contains a record for each business key which was delivered at a certain load date timestamp. With this satellite one can deduce on which load date timestamp a record is valid.
Barriers to a simple interface
The first solution is – as promised – very simple. The most comm on problems that stand in the way of such a simple approach are:
Data needs to be kept in the stage for n months
Performance problems while pulling the data
Any change is needed
Direct integration into the architecture of the source systems: messaging and events
Export needs to be done by the source system
Non-normalized data
Non-persisted data in the source system
Data needs to be kept in the stage for n months
If the data is kept for several months, e.g. to make problems with loading the data traceable, then it makes sense to save space and store only the changes, i.e. the delta. Storage space is getting cheaper and cheaper but still costs a lot of money. If only about 30% of the data changes, then 70% of the space can be saved with each dump. Data Vault’s loading patterns have only about 30% of the records in the input and are therefore faster when loading. However, this does not reduce the total loading time. A complete alignment of all data is still required to determine the delta.
The delta can be determined with a simple SQL pattern. The code for this can be generated with the help of metadata (templating). A stage with complete delta of all deliveries is called Persistent Staging Area (PSA). The PSA has some interesting advantages, which make virtualization or prototyping easier. A very good summary of the advantages of the PSA can be found in the following article: http:/roelantvos.com/blog/why-you-really-want-a-persistent-staging-area-in-your-data-vault-architecture.
Performance problems while pulling the data
The direct export of data can have a massive impact on the performance of an operational system. If there is no timeframe for an export within the nightly batch processing, if the extraction is to happen several times a day or if the system is used globally and a batch export is therefore not possible without performance losses, then a tool is needed for CDC or for mirroring the data.
Systems for CDC or for mirroring the data evaluate the logs of the database and enable copying the data without loss of performance. In this case, it makes sense to have only the delta, the changes, delivered. This is because these tools use the log, which shows the changes in the database, and thus already have the information about the delta. This saves performance when loading the Core Warehouse, since the change detection pattern while loading the data vault will run with fewer records in the stage.
Further details on CDC and a small overview of available systems can be found in https://medium.com/daimler-tss-tech/change-data-capture-lessons-learnt-7976391cf78d.
Any change is needed
If the data is always transferred as a complete load, only snapshots are available. If a data record has been changed two or more times between snapshots, individual changes are already overwritten. The changes made in the meantime are therefore not available.
Whether this loss can be accepted depends on the technical requirements. Often the loss of data is irrelevant, as the data was only valid for a short time between two loading cycles. Sometimes, however, it is important to document all state changes. In this case, a complete change history is needed. Here, a mechanism is needed that reliably provides all changes, such as a CDC tool, an event-driven interface, or a program to export the data.
Direct integration into the architecture of the source systems:
Messaging and events
Modern architectures use message systems to send processing requests from one subsystem to other subsystems. It makes sense to connect the data warehouse in the same way: via specific events. When using a message queue to deliver data to BI, the data of the message can be loaded directly into the core warehouse. In Data Vault, the Raw Vault is the complete copy of all data. Thus, there is no reason to keep the data in a Stage. Collecting the events in the stage and then loading them collectively at a point in time also makes little sense: if you receive the data in real time, you should also be able to evaluate it just as promptly.
Keeping the data in parallel in the Stage can be useful when using a PSA. This will be a second loading process parallel to loading the data into the core warehouse.
Export needs to be done by the source system
Sometimes there is no direct access to the data of the source system. This might be because of legal reasons or because of non-persisted date. Often this occurs with SaaS solutions. For most of the SaaS systems, there are readymade tools for extracting data or there are interfaces through which the data can be accessed.
What is needed here is a clear interface agreement and a service level agreement. It is important to find a long-term sustainable solution that will also support future developments. Unfortunately, this is often an individual solution of the respective provider of the SaaS solution. A standard would make maintenance and servicing easier.
Non-normalized data
Not all source systems store normalized data. Standardized interfaces are often not relational. Data formats such as XML or JSON usually are denormalized. While in one document the data of a multiple element (or in JSON: array, object) is always in a 1:n relationship to the key of the document, there may well be a second document where the subordination is reversed. For example, a customer may have multiple contracts and there is another, second document where the contract may have multiple customers.
This can only be resolved with additional information, such as the logical data model. Based on the logical data model, decisions can then be made as to how the data is to be stored in the stage in order to ensure a simple and – ideally – automated transfer of the data.
Modern relational databases can store JSON and XML. If one uses these elements to load data into the Raw Vault, the loading patterns must contain a filter to load the correct parts of the document. It might be simpler to just normalize the data in the Stage. The normalization can be done virtually via views on the JSON or XML objects stored in the database.
A deviation from this rule is useful when dealing with highly volatile data structures that are delivered for an event or business object. This data can be stored as a JSON structure in a satellite. The decision about the actual evaluation structure is then made at the time of the evaluation, when it is clear which data sets are selected and what information is available or desired.
Non-persisted data in the source system
It happens that a system does not save relevant data, but only calculates them at run-time. The results of the calculation are not in the database. The data is not available.
In this case, the source system must implement a data export using their algorithms. If this data is calculated on snapshot data by the data warehouse itself, it will be double development and thus, in the case of changes, additional double maintenance. This includes a higher error susceptibility due to deviations between the two implementations. Therefore, the operational system must program a data export or persist the data. Everything else is just too expensive.
The complete deduction is not always the solution
The various requirements result in several different solutions. Often mixed forms occur, resulting in even more solution variants. For example, when exporting through the source, it may be necessary to normalize the data or save only the delta. In the data from the CDC tool, there are supertype tables such as partners, which may also need to be separated by normalization. Too keep it simple only some common requirements and solutions are described. These solutions already form 7 groups:
For a simple interface, the treatment of the data should be as uniform as possible. If deviations are necessary, they should be locally limited. In this way, subsequent processing steps remain independent of the solution in the interface. If an employee changes from one data source to another, the employee should only have to deal with the differences in content. The handling of the data should be the same. These measures sustainably reduce complexity in the data warehouse.
So where are the differences in the process between these 7 groups?
One clear, obvious difference is the direct loading from the message queue into the Raw Vault. There is no way around this difference. The advantage in the speed of loading must be realized in order to make near-real-time evaluations possible.
The central loading patterns for hub, link and satellite for all 7 groups are the same. But there are differences in the status satellites. For snapshot data, the delivery history is kept in the Record Tracking Satellite. For CDC processing, on the other hand, the CDC indicator with Insert, Update or Delete is stored in a Status Tracking Satellite.
This is a big difference in accessing the satellites. Depending on the scope of delivery, the data in the satellite is accessed differently. The question of how to access a satellite is fortunately answered here in a self-explanatory way: Whether CDC or snapshot is indicated by the type of tracking satellite.
Nevertheless, the question remains how diverse the interfaces and the stage can be until it is perceived as too complex. At the same time, the data should be loaded quickly and easily without creating additional work for the operational system.
For solutions, an active design of data interfaces is needed. You can read more about this soon here in another blog article.