How to process Big Data efficiently? - NAWA

In order to personalize content, adjust and analyse ads, and provide safer experience, we use cookies. By using this website, you agree to the collection of information by us. The details can be found at: Privacy policy.

Prof. Robert Wrembel, holder of NAWA scholarship, attempts to answer this question in the research he conducts.

Prof. Robert Wrembel from Poznan University of Technology, winner of the Bekker Programme, as part of NAWA scholarship was involved in a project to develop methods for efficient processing of large volumes of heterogeneous data (so-called big data). This work attempts to make integration and analysis of large data sets easier and faster. The researcher held an academic internship at Universitat Politècnica de Catalunya (UPC) - BarcelonaTech.

 

What is the idea of Data Science, which is within your scientific interests?

Prof. Robert Wrembel, Poznan University of Technology: Almost 30 years ago, science and business introduced the term On Line Analytical Processing (OLAP) to refer to basic techniques for analysing data collected by companies. These techniques, for instance, included analysis of past sales trends and their future prediction using simple mathematical methods. Over time, more advanced data analysis techniques have been developed, including data mining, time series analysis, and sliding window analysis. As a result, a set of technologies for data analysis called Business Intelligence (BI) has been developed.

Further extensions concern the use of machine learning (ML) algorithms in data analysis. These algorithms are currently applied not only to simple data (structured as records in a table), but also to much more complex data (e.g., texts, graphs). ML algorithms allow for building complex models representing and predicting trends or models enabling understanding of a text (e.g. building summaries, assessing sentiment). Complex data need to be preprocessed in order to be processed in ML algorithms. This processing, and subsequent analysis of the data, are usually complex steps (depending on the application field and type of data) and are implemented in special architectures involving software and hardware.

In the Data Science world, these processing architectures are referred to as Data Processing Pipelines (DPPs). The term Data Science, introduced here, refers to both techniques for preparing data for analysis and the data analysis techniques themselves (e.g., the aforementioned algorithms for building prediction models). To sum up, Data Science is a set of technologies (software, data processing architectures, algorithms) that enable advanced analysis of heterogeneous data (in terms of models and structures) to discover non-apparent models and relationships between data.

 

What kind of research did you conduct as part of your NAWA scholarship?

The aim of the project was to develop methods for efficient processing of large volumes of heterogeneous data  - so-called big data. The research group at UPC I was a member of has been working on this topic for years and has gained international recognition.

As part of the project, I was particularly involved in developing: techniques for discovering the so-called metadata from sources that do not directly provide such data; algorithms for optimizing data pre-processing; generalized architecture for data processing by Data Science.

Each of the aforementioned tasks was completed and the results were published. In addition, with my fellow researchers, we have published two review articles on the current state of technology in the area of data engineering with an indication of open research problems.

 

How can the application of data engineering technologies, i.e. advanced database technologies and data management techniques, improve data processing?

There are two main challenges with data processing. First, at present, we process data of multiple structures - from simple tabular data to complex graphs; these are called heterogeneous data. In order to analyse such data, it is necessary to integrate them, i.e. to unify their structures, remove errors and duplicates. These are the basic tasks of Data Science techniques. Unfortunately, these tasks often cannot be done automatically, due to the complexity of the problem. Therefore, data integration issues are still among the most significant ones in the field of data engineering. Developing techniques to fully automatically integrate data of non-uniform structures would significantly reduce the time that data scientists spend on integrating data. It should be noted (based on available studies) that about 50% to about 80% of the time in a Data Science project is spent on data pre-processing.

The process of data integration, pre-processing, and analysis should be supported by database technologies. This will guarantee: (1) data sharing among teams of data scientists, (2) increase processing performance - thanks to the use of optimization mechanisms in the database itself, (3) data security - thanks to the archiving and disaster recovery mechanism available in each database management system, and (4) data access authorisation. Unfortunately, data scientists often fail to take advantage of the available data engineering technologies, which reduces the efficiency of their work. We addressed this issue during the project by developing a new data processing architecture for Data Science.

The second issue in the field of data processing is the efficiency of the data integration and analysis processes. Despite more than 30 years of work on optimizing data integration processes, the issue still remains unsolved. Ensuring the efficiency of these processes is key in traditional data (data warehouses) and big data (data lakes) integration architectures. These processes must be executed within a strictly-defined time window of about 8 hours in typical applications. Otherwise, data will either not be available at all or the data available will be too old. In this project, we proposed some techniques that can reduce the run-time of data integration processes.

 

What was the impact of the NAWA scholarship on your development as a scientist?

The measurable scientific output includes seven conference and journal publications altogether , which I have published with eight scientists from abroad. The following two benefits from the scholarship are most important to me. First, closer cooperation with the ESSI team at UPC. Second, establishing topics for joint research work with the ESSI team. If it were not for the pandemic, I would be carrying out this work right now during my next stay in Barcelona; the work will continue after the pandemic.

Moreover, additional effect of my stay in Barcelona was the improvement of my level of Spanish to C1 (confirmed by an exam and a certificate from Universitat de Barcelona).

 

If you were to encourage other scientists to take part in the Bekker Programme, what is the greatest value of a NAWA scholarship?

In my opinion, NAWA grants, and in particular the Bekker Programme, are excellent addition to grants offered by the National Science Centre (NCN). The scholarship I received allowed me to complete a research project with an internationally recognised group of scientists; and most importantly, the group that carries out exactly the same scientific research as I do. The application procedure is reasonable and rather of a standard nature (similar as in the case of other national grants); it took me two weeks to prepare the application. Great value of NAWA grants under the Bekker Programme is an opportunity to have your stay abroad funded. The rates make it possible to live a normal life abroad (i.e. like the people I worked with). In addition, the grant settlement procedure is very clear and not too complex. Completion of the research as per the schedule, supported by several good publications, enables the grant to be settled. Finally, the entire NAWA team is extremely helpful and friendly. Any questions or concerns I had were answered immediately.

Thank you for your time.

 

robert wrembel portrait 2017 10 13 medium

Robert Wrembel is an associate prof. in the Faculty of Computing and Telecommunications, at Poznan University of Technology. He was a deputy dean in terms 2008-2012 and 2012-2016. Involved in 8 research and 8 industrial projects in the area of data processing technologies; consultant at software house Rodan Systems (2002-2003); lecturer at Oracle Poland (1998-2005). A visiting scholar at: Universitat Politècnica de Catalunya (Spain), Université Lyon 2 (France), Targit (USA), Universidad de Costa Rica (Costa Rica), Klagenfurt University (Austria), Loyola University (USA). A graduate from a 2-months innovation and entrepreneurial programme at Stanford University. Received the IBM Shared University Research Award (2019), the Service Award from the International Federation for Information Processing - IFIP (2019), and the IBM Faculty Award (2010). Country representative in IFIP Technical Committee - Software: Theory and Practice, and a chair of IFIP Working Group - Database. Steering Committee member of: (1) the European Conference on Advances in Databases and Information Systems (ADBIS) and (2) the International Workshop On Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP); a regular editorial board member of the Data & Knowledge Engineering journal. Research interests: data integration, data warehouses and data lakes, big data, multiversion databases.

How to receive a NAWA scholarship under the Bekker Programme? 

Detailed information about the NAWA programme can be found HERE

Share