Welcome to the first post of Learn Apache Spark platform.
This post will walk you through where big data came from and why it exists.
But, first thing first, what is Not Big Data?
Basically, everything that can fit into one computer/machine. How did we work with data before Big Data? Relational DataBases(RDB) were extremely popular (and are still heavily in use) .
What are Relational DBs?
" A relational database is a digital database based on the relational model of data. It is a set of formally described tables from which data can be accessed or reassembled in many different ways without having to reorganize the database tables." - Wikipedia.
Examples for RDB are MySql and SqlServer.
In RDB the data is being modeled using relations. The goal is to have as little to no data duplications as possible at the price of creating many tables and reference links. This is data normalization. Data normalization has many rules for what we need to do in order to normalize the data.
The Rules are First Normal Form( 1NF), 2NF, 3NF, 4NF and BCNF - Boyce and Codd Normal Form . To learn more about data normalization, click here.
Although SQL is a well-structured query language, it's not the only tool to handle data today. We will go over why, but first, let's hold an example in mind:
We are building a social media site. Like Twitter.Our site holds data that translates into many relations(tables). Some of our main screens are being constructed from the next questions:
-> How many tweets liked by a specific person?
-> how can I see my personal twitter feed?
These questions are the main blocks of our business requirements: Our users would like to have access to tweets from their friends or the people that they follow. They want to see the count of likes, count of shares and comments. How can we design an SQL query to get that data? We can look at the following of each of our users, do a join operation on recent tweets, join operation on likes, another join operation on retweet with join operation on comments and so on.
- join - this is a SQL join operator, where we combine columns from one or more tables in RDB according to a specific key.
This is only one option on how we can query the data to answer the main business question.
Now, to a more focused business question:
-> How many likes one specific user generated?
How do we answer this question? we can leverage SQL and create a query for that as well. To sum it up, we can develop a whole social media platform using SQL.
But, wait, we didn't talk about Scale and the size of the queried data. According to analysts, we are facing 100 of millions of tweets a day. Meaning it won't fit in one machine. On top of that, we now face new requirements: Provide results super fast. What super fast mean? in a matter of few milliseconds. Whooops. Now a query with 5-10 joins that takes more than 1, 2 or 5 sec to finish is not good enough (according to today's business needs).
Now that we understand that RDB isn't always the answer, let's look a what is Big Data.
So, what is Big Data?
It's a large set of data that can't fit into one machine.
Two main challenges:
1. More Data to process ( then can fit in one machine )
2. Faster answer/query response requirements
One of the techniques to work with a large set of data is NoSQL. Where in most scenarios there are no relations at all. This approach can help us get the results faster, as well as store more data. NoSQL is one approach to how we can solve the challenges of today's data-driven systems.
How do we work with data with Big Data constraints? We first think about how we are going to use that data. What are the business cases that we would like to answer? How we will store and later query the data? There is a price for working with Big Data and being fast. One of them might just be giving up on query flexibility.
We should know in advance which queries we need to serve to answer the business question. We will prepare in advance for queries - by duplicating the data in various ways to gain faster reads at the price of slower writes. This process is called Denormalize data.
" Denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data in advance." -Wikipedia.
That brings us to a new challenge - data design challenge:
How to organize the data so it will be easy to query it fast?
Let's go back to our example. How many tweets are liked by a specific user? For solving that, we will keep a counter and update it on every 'like' as part of the write operation. That will save us the cost of calculating the count at read time.
But, once we go with that approach, we need to build a data version for every specific question/business case. We become fast but lose flexibility. Another approach is to write a row for every 'like' operation and calculate it once every 5 minutes, to keep the result '5 minutes fresh'. That brings us to new storage solutions: NoSql, DocumentDB, GraphDB and more.
All the above solutions, need to support a load of many queries at high scale while not getting stuck in transactions with read/write locks.
Read/write locks are synchronization methods to solve readers-writers problems and usually means that multiple threads can read the data in parallel but only a thread with the exclusive lock can update the data.
This is why read/write locks have the potential of getting us stuck by locking part of the data for unknown time or TTL (Time To Live - also used as the max threshold to how much time the resource will be locked).
This conclude the essence of Big Data challenges, which brings us to Big Data frameworks. Today, many companies will try to store the data in a less processed manner, raw data, without losing any data and not filtering it. Of course with addressing information privacy laws such as GDPR, HIPAA and more- according to requirements.
This is why we want to leverage various frameworks to process data at scale, fast. Let's look at the next approach that is built on the idea that we can capture snapshots of the data and just read them when necessary.
Building blocks for designing software architecture to support scale:
1. Data Lake
2. Data Warehouse
3. Processing Layer
4. UI / API
A data lake is where we save all the data in a raw manner, without processing it. We save it in that manner so later, when business cases arise, or new products, we will have it relatively available and in raw format. This storage should be cheap as we would like to store everything we can( with data compliance in mind). The cloud allows us to store everything. On Azure, we can store everything in blobs storage. On AWS, we can use s3. We use the blobs as an endless file system. The data that we store in Data Lake will probably have no meaning/purpose at the start and only stored so we will not lose data or business opportunity revolving it later.
Data Warehouse (DW)
A data warehouse will hold more processed data, perhaps with a specific structure for fast query or fast processing. DW will be highly used by Data Scientists, Business Analysts, and Developers. DW will have indexes for fast accessing and data will be saved in a unique/optimized manner. Think about it like a library with an index, where you don't need to go over all the books to find the book you want.
Up until now, we discussed Data Lake and Data Warehouse. Data that reached DW already has a purpose and is used for answering business questions. Meaning, that data will have at least two duplications, one in raw format and more copies in processed format saved at the DW.
When working with data, we will want tools that help us to analyze, systematically extract information from, create machine learning models and more. This is the responsibility of the processing layer. We will do it in a distributed manner.
There is a whole ecosystem for Distributed Processing, one of the known frameworks is Apache Spark:
Apache Spark is part of the distributed processing ecosystem. It has a built-in modules for Batch, Streaming, SQL, Machine Learning and Graph processing.
UI / API
Many times we will want to build an interface to serve the business, It can be a UI(User Interface) or API(Application programming interface) that will serve our clients. Our clients will usually want insights on the data. It can be in the shape of a twitter feed. This is the last layer of our basic generic architecture that will deliver the outcome to our clients.
💡 Summary 💡
Let's wrap it up. These building blocks open a whole new world of workload management, various types of files (Json, Binary, Parquet, TDMS, MDF4 and etc), log systems, new architectures specifically designed to handle scale and more.
Today there are many ways to describe big data. But, at the end, we need to know and understand the tools in order to optimize our work and design and implement the best solution to fit our business requirements.
On this site we will publish content, blog posts and videos for you to learn and be successful with developing big data solutions leveraging Apache Spark and the big data ecosystem.