Big Data Lineage – from legacy to the latest
July 1, 2018

Companies are leveraging the data that they have collected over many years and moving that into Data Lakes. The data in data lakes is used to run analytics by exploiting big data technologies from companies like MapR, Cloudera and Horton works. The underlying technologies used to ingest this metadata and run analytics include the likes of Sqoop, Oozie, Falcon, Hive, Spark (written in Java, Python, Scala). As the processing power has grown and the cost of storage has declined, there has been an explosion of derived data. This accentuates the standard challenges in governance to try and understand what is going on within the big data systems and understand where the information into the data lakes is being pulled from. Without this insight the result of the analytics is questionable at best. Another major concern in the heads of the CIO’s and CRO’s of the companies is the risk exposure caused by the explosion of information. We recently did ingest information for a Global telecommunication organization. They had thousands of Java programs, Python scripts, ETL jobs processing information. The complexity was such that there were Python scripts calling functions in java classes. It is critical that the users in the enterprise share information on whether those Java functions called by the Python scripts are trustworthy. The more visibility the users have into what’s being used and how trustworthy the information is, the higher the chances of the results being trusted. The success of agile programming is a clear sign that companies adopt the concept of peer-peer review to get a better value for their money. All this won’t be possible without horizontal lineage, i.e inter-system lineage that connects these different technologies together in a scalabe and automated manner.

Seeing is Believing

Let us show you how you can find the right information and its impact in minutes!