9. February 2021

Actively designing data interfaces pays off

by Michael Müller

Even with interfaces, standards reduce the complexity and lower the effort. The fewer exceptions there are, the easier maintenance and servicing are. Fewer exceptions mean that new employees find their way around more quickly. Standardisation is the foundation on which automation is built.

The blog post ‘Simple and fast interfaces for BI‘ explained when the simplest and fastest form of a data interface – the snapshot data- is not sufficient. The following picture did emerge:

Some of the interfaces described in this picture have an impact on the data retrieval, because either the CDC indicator or the delivery history is stored in the tracking satellite. Normalisation of the data can take place on the way to the stage or when loading into the Raw Vault. A structure as described above has many peculiarities and is suspiciously reminiscent of the grown structures of a data warehouse that has been running for many years. If complexity is to remain low here, more uniformity is needed. The fewer exceptions an employee has to keep in mind, the easier it is to train new employees or to have other team members take over work packages during peak loads.

How can the interface be simplified and the processes standardised?

Not always a simple solution: The source system provides the changed data

Why not delegate the whole problem and hand it over to the source systems?

Especially against the background of data model changes and the resulting changes to the deliveries, it makes sense to anchor the responsibility completely with the source system. During the implementation of changes or new features in the source system, the effect on the interface can be checked and, if necessary, a change of the interface implemented.

Thus, the source system determines which data is passed on to the data warehouse. And this is precisely where a major conflict lies. The development work for the data export competes with the development of functionality for the source system. If the contracting authority is not also responsible for the data warehouse, it sometimes happens that the data export is down-prioritised. This means that the data is not available in time for evaluations.

In addition, with modern architectures, the need for batch processing is disappearing. Fewer and fewer developers have experience with batch processing or batch interfaces. There is no tool that helps the source system to create these exports. Whereas there are tools for determining CDC. These tools are typically in the realm of BI, as they do not require any information about the operational system other than the database structure. This way, a team with tool know-how can take care of all exports.

For these reasons, the basic provision of the changed data by the source systems is not a recommendable solution.

CDC as a workhorse

The CDC tool solves any performance problems for the operational database that a database dump may cause. With automatic delta formation, all changes are transferred and there is a clear indicator when data has been deleted.

If a CDC tool has been purchased or can be obtained cheaply, then it should be used for all sources whose database is supported by this CDC tool. This limits the database dump only for systems where no CDC tool is available.

The Stage as a Level of Abstraction

To reduce complexity, the number of interface types should be kept as low as possible. The 7 groups can be reduced to four different interface types.

If the delta in the stage is made mandatory, the loading patterns in the Raw Vault will be the same. The Record Tracking Satellite is no longer required. The developer does not need to know anything about the interface when accessing the satellite data.

This simplifies the input picture. Before deltaing, the data must be normalised to the primary key so that an identifier can be determined for each record. Subordinate 1:n groups must be separated and receive their own CDC indicator. Normalisation is added as an optional clean-up step.

This leads to the following picture of Data Interface and Stage:

Almost all data is loaded in the same way and the stage acts as an abstraction layer. Unfortunately, this is not possible for deliveries via a message system. The message system must still be able to load data directly into the Raw Vault. This is the best way to support near-realtime evaluations. Storing messages as individual records in the stage can lead to delays.

When using a PSA (Persistant Staging Area), the data of the message queue would also be loaded into the stage in parallel for this reason.

This allows the 7 groups and their hybrid forms to be mapped to this smaller picture. The stage becomes an abstraction level, the data is stored as insert, update and delete in relation to the respective primary key. Depending on individual requirements, these 4 groups can be further reduced and it is clear how new requirements are to be implemented.

The abstraction is bought with another work step, as the matching of the complete data happens twice. How much this slows down the entire processing chain depends on the individual case. If a stage table fills 3 satellites and the delta comparison reduces the number of records to 25%, then instead of comparing 3*100% of the records, now 1*100% and 3*25% of the records have to be compared. The frequency of change and the total number of sets to be loaded are the determining factors here. Hubs are loaded faster because only inserts are compared. For links, the same applies as for satellites. However, hubs and links are not quite as important because the complete comparison of all attributes is the biggest effort here.

Persistent Staging Area (PSA)

If you save all delivered data in the stage, you have implemented a Persistent Staging Area (PSA). Thanks to the PSA, all data can be reloaded at any time. The core warehouse can thus be quickly adapted to new requirements and findings. This is especially helpful in the first weeks and months of connecting a new data source. New solutions can be quickly tried out on the PSA – virtualised or persisted. Later adaptations to the architecture can be implemented by reloading the data without the need for transformation scripts. More about PSA in http://roelantvos.com/blog/why-you-really-want-a-persistent-staging-area-in-your-data-vault-architecture.

However, this also creates another place – besides the core warehouse – where all data is available. One can argue at length about the advantages and disadvantages; it is ultimately a question of the requirements and the available resources. But even if you do not keep the data completely duplicated and only keep it for a certain period of time or only for new data sources, many advantages of the PSA can already be implemented.

Only the fast availability of data counts

Nevertheless, the only thing that counts in BI is the quick availability of data. Does this level of abstraction cost time? Will high implementation efforts be necessary?

The described normalisation must take place on the way into the Raw Vault in any case. Additional realisation effort is only required for delta formation. This can be implemented as a simple SQL set operation between the current and last delivery:

only in the current delivery: Insert
only in the last delivery: Delete
in the intersection and changed: Update

With some metadata and a template, this SQL can be generated. The implementation effort remains low. The processing time depends on the number of satellites, links and hubs and can improve or worsen performance. Only in extreme cases Performance will be an obstacle. If an extreme case occurs, there question of solving it with an CDC tool and reduce the amount of data at its core.

The fast availability of data is also a criterion when connecting a new data source for the first time. In these cases, the complete database dump is unbeatable in terms of speed. It is set up quickly and enables loading data into the data warehouse quickly. A second database dump should be done just as fast in order to load changes correctly into the data warehouse. In this way, a system is quickly set up that can reliably deliver data on a regular basis. The delta formation in the stage makes it possible to quickly eliminate early errors by reloading the data. When, for example, the CDC tool is available later, the interface can simply be exchanged.

Abstraction in the stage does not cost much, creates order and facilitates refactoring. It creates sustainable added value and should always be included in the development guidelines.

You would like to learn more? Just contact me!

contact now