Our Sales and Engineering teams frequently share their expertise online via articles and technical pieces about the industry and the new trends in technologies. Here is an interesting article about the very nature of Data.
Is Data an introvert or an extrovert?
Was the angel only testing the woodcutter’s morals?
And does the human brain store more data, especially after marriage?
A Date with Data
By Head of International Projects Omer bin Abdul Aziz
..Of Data lakes and warehouses
If you were a Star-trek fan, you would know how important Data is! If you are not, then let us suffice by saying that survival of the automation engineer species is linked with developing an awareness of data, where to find it, how to carry out a dialogue with it – make it speak, express itself, and how to encourage one set of data to have a conversation with another set of data. Help them find a spark for each other! If you cannot learn how to do it, our species is bound to go extinct. Trust me, it won’t need an asteroid hit!
Data by the very definition is introvert in nature. You have to master the art of communication if you want to get anywhere close to having a date with data! It is not possible to give a master class on data in a single blog, nor am I the master to do that. I am sure universities in the world are right now creating curriculum on bachelors and masters level on data. Something to watch out for the next generation ‘! However, we can start ‘bit by bit’ .. and first develop an understanding of where to find data – where does it exist in its natural habitat.
One place where you can find data is called lakes. Data Lakes. Lakes are reservoirs of data. Sometimes formed naturally .. and other times created for the very purpose. Data stored in lakes does not follow a strict definition of storage and retrieval or form factor. This is something we know from early childhood. Remember the story of the woodcutter and his ax? The angel had to dive many times in the lake to find his rightfully owned ax. If angels have problems, we are mere mortals! (And we were told the angel was testing the woodcutter’s morals .. how naïve we were!)
Data can be in a way – just thrown in the lake as we get it without any worries of its form. It is when we have to read data that we define the rules on what to extract from the lake. This is called “Schema on Read” . So there are no gate-keeping rules in place for incomers – anyone can crash the party!
You can have weblogs, JSON, XML, streaming data, data from ERP, MES – just thrown into the data-lake. There is usually no limit on file size. You can store as big a file as you want to.
Another place where you can find data – is a warehouse. Its called Data Warehouse (DW), or Enterprise Data Warehouse (EDW). Warehouses are always created for the specific purpose of storing data. Therefore, they are very well organized. Huge volumes can be stored and retrieved with ease. When it comes to storing data in the warehouse, we need to establish the protocol in advance. This is to say that we define “Schema on write”. For example, if you want to write to a database, you need to know the structure of the database and what type of data it will accept. If it is expecting an “8 digit numeric input” and you send it a set of alphabets, it just won’t accept it.
Why have lakes .. and warehouses?
Now if we can store data so neatly in warehouses, what do we need the lakes for? It’s because storing data in warehouses, has to conform to the warehouse storage requirements. It should be possible to re-form it so that it can be stacked, stored, and retrieved with ease. If the data that you have cannot follow these somewhat stringent requirements, then it goes to the lake, where you can just throw it in the water for later retrieval.
When you first define the form of data and then store it, you also limit yourself to the extent of analytics and machine learning you can do on that data. So one reason for keeping data in lakes is to have the ability to carry in-depth machine learning.
Then why do you have warehouses? Well, in a warehouse, querying data is must faster. So it is the best way to store data for regular reporting and dashboards.
For example, let’s say that a utility company wants to improve the customer experience and reduce the time it takes to respond to customer complaints. A conventional data warehouse will provide top-notch analytics and reports on where most complaints are being generated, where are complaints expected in the future in a particular season based on history, what is the turnaround time of complaints.
Is this enough in this age of social media? Do customers register a complaint on the utility’s helpline first or post their dismay at the breakdown on Twitter before that? Chances are that before someone registers the complaint on phone – the handles related to the issue would have gone viral on social media. So maybe, we need some more insight into the data than just a real-time report of registered complaints. Data lakes can come to the rescue. You can throw newsfeeds from Twitter in the data lake and define your code to keep on analyzing them for any service breakdown issues. Once your code picks up something important, it can create a customer complaint on its own!
And now .. some examples.
Enough for the definitions – let’s get some real-life examples of each now. First the lakes. I would argue that the most advanced and exciting naturally found data lake, is the human brain. You can argue it is the Dolphin or sperm whale brain instead, but then, I would argue that you are not married yet!
An example of an artificially created data lake is Hadoop. For writing data to Hadoop, you do not need to define the schema. You can simply throw the data in it. It breaks down your data and stores multiple copies of it on different servers. When you query this data, you can write the schema that you want in the code through which you are querying.
A good example of a data warehouse would be SAP HANA. It supports up to a petabyte of data in memory with a query time of less than a second!
If you are now all excited about the topic, you can learn more about various data lakes and warehouses through YouTube videos. I will save my breath and let you explore it on your own!
Data does not just fall in the lakes and warehouses out of the sky. Someone (and yes .. us resourceful engineers) have to extract that data. We have to go on quests no less daunting and perilous than those of Apollo and Hercules to get our hands on the precious bytes (how is that for a restaurant name!). But this is a story for another time.
Do you know of any fun facts about data? Do share on our Social Media Pages! Humor me .. or should we say .. data me!