Bob is an enthusiast data engineer. He is hired by a company specialized in data analysis to build modern, robust data pipelines. His first task is to design a data pipeline that reads raw data, normalize it and then compute some aggregations on it.
Data files will be received in an immutable persistent read only staging area.
A first Spark job will normalize data from the source, apply technical controls on it and then store results in columnar storages format (ex: Parquet).
A second Spark job will read the normalized data, perform some aggregation on it and expose results to the end users.
The problems are stacking up and Bob has no idea how to solve them. Each time he gets close to the solution, he faces the same obstacles. A vicious cycle of problems that could cost Bob his success. Sadly, Bob is now at a dead end and his superior’s schedule and expectations are putting pressure on him.