Connect with us
Apply Now


The practice of ETL data integration: practice and no unnecessary theory

The topic of low-code is quite broad, especially etl data integration solution, but now we would like to talk about the application of low-code concepts.

 Applied experience with low-code

The Big Data Solutions division of Neoflex to a large extent specializes in the financial sector of business, building data warehouses, and data lakes and automating various kinds of reporting. In this niche, the use of low code has long been the standard. Among other low-code tools, one can mention tools for ETL processes organization: Informatica Power Center, IBM Datastage, and Pentaho Data Integration. Or Oracle Apex, which acts as a medium for fast development of data access and editing interfaces. However, the use of low-code development tools does not always imply building narrowly targeted applications on a commercial technology stack with a clear vendor dependency.

Low-code platforms can also be used to orchestrate data streams, create data-science platforms, or, for example, modules for checking data quality.

One of the applied examples of the use of low-code development tools is Neoflex’s collaboration with Mediascope, one of the leaders in the Russian media research market. One of the tasks of this company’s business is the production of data, on the basis of which advertisers, Internet sites, TV channels, radio stations, advertising agencies, and brands make decisions about buying advertising and planning their marketing communications.

Media research is a technologically laden area of business. Video recognition, collecting data from devices that analyze viewing, and measuring activity on web resources – all this requires a large IT staff and tremendous experience in building analytical solutions. But the exponential growth in the amount of information and the number and variety of its sources makes the IT data industry constantly progressing. The easiest solution to scale an already functioning Mediascope analytics platform might have been to increase IT, staff. But a much more effective solution is to speed up the development process. One of the steps leading in this direction could be the use of low-code platforms.

At the start of the project, the company already had a functioning product solution. However, the implementation of the solution in MSSQL could not fully meet the expectations of scaling functionality while maintaining an acceptable cost of refinement.

The task we faced was truly ambitious – Neoflex and Mediascope had to create an industrial solution in less than a year, assuming that the MVP was already available within the first quarter of the start date.

The Hadoop technology stack was chosen as the foundation for building a new data platform based on low-code computing. The data storage standard was HDFS using parquet format files. To access the data in the platform, Hive was used, in which all available storefronts are represented as external tables. Data uploading to the repository was implemented with Kafka and Apache NiFi.

Lowe-code tool in this concept was used to optimize the most time-consuming task in building an analytics platform – the task of calculating data.

The obvious advantage of this approach is the acceleration of the development process. However, in addition to speed, there are also the following advantages:

  • Viewing the content and structure of sources/receivers;
  • Tracing the origin of data stream objects to individual fields (lineage);
  • Partial execution of transformations with a review of intermediate results;
  • Previewing source code and correcting it before execution;
  • Automatic validation of transformations;
  • Automatic 1-in-1 data loading.

The entry threshold for low-code solutions for generating transformations is rather low: the developer should know SQL and have experience working with ETL tools. Note, however, that code-driven transformation generators are not ETL tools in the broad sense of the word. Low-code tools may not have their own code execution environment. That is, Visual Flow notices that the generated code will be executed on the environment which was present on the cluster before the low-code solution was installed. And this is probably another plus in the karma of low-code. Because a “classic” team, which implements the functionality, for example, in pure Scala-code, can work in parallel with the low-code command. It would be easy and “seamless” to integrate the developments of both teams into the product.

Perhaps it is also worth noting that there are no-code solutions in addition to low-code. In essence, these are different things. Low code is more about allowing the developer to intervene in the generated code. In the case of Datagram, you can look and edit the generated Scala code, no-code, may not provide this possibility. This difference is very significant not only in terms of the flexibility of the solution but also in terms of comfort and motivation in the work of data engineers.

Solution architecture

Let’s try to understand how exactly the low-code tool helps to solve the task of optimizing the speed of development of data calculation functionality. First, let’s analyze the functional architecture of the system. In this case, our example is a model of data production for media research.

  • Pipelmeters (TV meters) are hardware-software devices, which read the user behavior of TV panel respondents – who, when, and what TV channel was watched in a household participating in the study. Delivered information is a stream of intervals of watching the air with reference to the media package and media product. Data at the download stage to Data Lake can be enriched with demographic attributes, geostrategic reference, time zone, and other information needed for analysis of TV viewing of a particular media product. The measurements taken can be used to analyze or plan advertising campaigns, assess audience activity and preferences, and compile an airwaves grid;
  • Data can come from TV streaming monitoring systems and measurement of viewing the content of video resources on the Internet;
  • Measuring tools in the web environment, including both site-centric and user-centric counters. The data provider for Data Lake can be a research bar browser add-on and a mobile application with a built-in VPN.
  • Data can also come from sites that consolidate the results of online questionnaires and the results of telephone interviews in company surveys;
  • Additional enrichment of the data lake can take place by downloading information from partner companies’ logs.
Continue Reading

Copyright © 2022 Disrupt ™ Magazine is a Minority Owned Privately Held Company - Disrupt ™ was founder by Puerto Rican serial entrepreneur and philanthropist Tony Delgado who is on a mission to transform Latin America using the power of education and entrepreneurship.

Disrupt ™ Magazine
151 Calle San Francisco
Suite 200
San Juan, Puerto Rico, 00901

Opinions expressed by Disrupt Contributors are their own. Disrupt Magazine invites voices from many diverse walks of life to share their perspectives on our contributor platform. We are big believers in freedom of speech and while we do enforce our community guidelines, we do not actively censor stories on our platform because we want to give our contributors the freedom to express their opinions. Articles are not commissioned by our editorial team, and opinions expressed by our community contributors do not reflect the opinions of Disrupt or its employees.
We are committed to fighting the spread of misinformation online so if you feel an article on our platform goes against our community guidelines or contains false information, we do encourage you to report it. We need your help to fight the spread of misinformation. For more information please visit our Contributor Guidelines available here.

Disrupt ™ is the voice of latino entrepreneurs around the world. We are part of a movement to increase diversity in the technology industry and we are focused on using entrepreneurship to grow new economies in underserved communities both here in Puerto Rico and throughout Latin America. We enable millennials to become what they want to become in life by learning new skills and leveraging the power of the digital economy. We are living proof that all you need to succeed in this new economy is a landing page and a dream. Disrupt tells the stories of the world top entrepreneurs, developers, creators, and digital marketers and help empower them to teach others the skills they used to grow their careers, chase their passions and create financial freedom for themselves, their families, and their lives, all while living out their true purpose. We recognize the fact that most young people are opting to skip college in exchange for entrepreneurship and real-life experience. Disrupt Magazine was designed to give the world a taste of that.