A Simple Big Data pipeline with MySql
- anydataflow
- Oct 19, 2019
- 1 min read
Updated: Jan 22, 2020
Efficient implementation of data pipeline increase performance of Data warehouse and ease to generate quality KPI.
As you already know bigdata is very important these days to analyse huge data in less time and with less effort. We are discussing about a simple use case which can be seen around in real life, which is connecting RDBMS with Hadoop/Spark ecosystem and show the analytical dashboard on BI tool.
Architecture Diagram

This architecture looks simple in first look but when you go in deep while implementation then you need to take care of multiple things like connector, code quality, data type conversion, serialization, spark optimization, sqoop optimization, hive storage optimization... many more.
Component Used
RDBMS
Hadoop cluser
Spark
Sqoop
Hive
Business Intelligence Tool
It can be any BI tool which can connect to hive/spark by using jdbc/thrift connction. Also we can attach any MPP engine on top hive to query in ms.. we have used prestodb to get speed 10x faster then spark.
Thanks for vising...
Comments