Community

The practice of ETL data integration: practice and no unnecessary theory

The topic of low-code is quite broad, especially etl data integration solution, but now we would like to talk about the application of low-code concepts.

Applied experience with low-code

The Big Data Solutions division of Neoflex to a large extent specializes in the financial sector of business, building data warehouses, and data lakes and automating various kinds of reporting. In this niche, the use of low code has long been the standard. Among other low-code tools, one can mention tools for ETL processes organization: Informatica Power Center, IBM Datastage, and Pentaho Data Integration. Or Oracle Apex, which acts as a medium for fast development of data access and editing interfaces. However, the use of low-code development tools does not always imply building narrowly targeted applications on a commercial technology stack with a clear vendor dependency.

Low-code platforms can also be used to orchestrate data streams, create data-science platforms, or, for example, modules for checking data quality.

One of the applied examples of the use of low-code development tools is Neoflex’s collaboration with Mediascope, one of the leaders in the Russian media research market. One of the tasks of this company’s business is the production of data, on the basis of which advertisers, Internet sites, TV channels, radio stations, advertising agencies, and brands make decisions about buying advertising and planning their marketing communications.

Media research is a technologically laden area of business. Video recognition, collecting data from devices that analyze viewing, and measuring activity on web resources – all this requires a large IT staff and tremendous experience in building analytical solutions. But the exponential growth in the amount of information and the number and variety of its sources makes the IT data industry constantly progressing. The easiest solution to scale an already functioning Mediascope analytics platform might have been to increase IT, staff. But a much more effective solution is to speed up the development process. One of the steps leading in this direction could be the use of low-code platforms.

At the start of the project, the company already had a functioning product solution. However, the implementation of the solution in MSSQL could not fully meet the expectations of scaling functionality while maintaining an acceptable cost of refinement.

The task we faced was truly ambitious – Neoflex and Mediascope had to create an industrial solution in less than a year, assuming that the MVP was already available within the first quarter of the start date.

The Hadoop technology stack was chosen as the foundation for building a new data platform based on low-code computing. The data storage standard was HDFS using parquet format files. To access the data in the platform, Hive was used, in which all available storefronts are represented as external tables. Data uploading to the repository was implemented with Kafka and Apache NiFi.

Lowe-code tool in this concept was used to optimize the most time-consuming task in building an analytics platform – the task of calculating data.

The obvious advantage of this approach is the acceleration of the development process. However, in addition to speed, there are also the following advantages:

Viewing the content and structure of sources/receivers;
Tracing the origin of data stream objects to individual fields (lineage);
Partial execution of transformations with a review of intermediate results;
Previewing source code and correcting it before execution;
Automatic validation of transformations;
Automatic 1-in-1 data loading.

The entry threshold for low-code solutions for generating transformations is rather low: the developer should know SQL and have experience working with ETL tools. Note, however, that code-driven transformation generators are not ETL tools in the broad sense of the word. Low-code tools may not have their own code execution environment. That is, Visual Flow notices that the generated code will be executed on the environment which was present on the cluster before the low-code solution was installed. And this is probably another plus in the karma of low-code. Because a “classic” team, which implements the functionality, for example, in pure Scala-code, can work in parallel with the low-code command. It would be easy and “seamless” to integrate the developments of both teams into the product.

Perhaps it is also worth noting that there are no-code solutions in addition to low-code. In essence, these are different things. Low code is more about allowing the developer to intervene in the generated code. In the case of Datagram, you can look and edit the generated Scala code, no-code, may not provide this possibility. This difference is very significant not only in terms of the flexibility of the solution but also in terms of comfort and motivation in the work of data engineers.

Solution architecture

Let’s try to understand how exactly the low-code tool helps to solve the task of optimizing the speed of development of data calculation functionality. First, let’s analyze the functional architecture of the system. In this case, our example is a model of data production for media research.

Pipelmeters (TV meters) are hardware-software devices, which read the user behavior of TV panel respondents – who, when, and what TV channel was watched in a household participating in the study. Delivered information is a stream of intervals of watching the air with reference to the media package and media product. Data at the download stage to Data Lake can be enriched with demographic attributes, geostrategic reference, time zone, and other information needed for analysis of TV viewing of a particular media product. The measurements taken can be used to analyze or plan advertising campaigns, assess audience activity and preferences, and compile an airwaves grid;
Data can come from TV streaming monitoring systems and measurement of viewing the content of video resources on the Internet;
Measuring tools in the web environment, including both site-centric and user-centric counters. The data provider for Data Lake can be a research bar browser add-on and a mobile application with a built-in VPN.
Data can also come from sites that consolidate the results of online questionnaires and the results of telephone interviews in company surveys;
Additional enrichment of the data lake can take place by downloading information from partner companies’ logs.

Disrupt

The practice of ETL data integration: practice and no unnecessary theory

Solution architecture