
“Fundamentals of Data Engineering” book review
There was a time, a couple of years ago, when big data was the most hyped term. It was like a teenage sex – everyone was talking about it, nobody really knew how to do it, everyone thought everyone else was doing it, so everyone claimed they were doing it. What is left of that today?
Table of Contents
ToggleShift in the naming
The above quote is taken from Dan Ariely’s Facebook post that then went wiral.
Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone… http://t.co/tREI1mRQ
— Dan Ariely (@danariely) January 6, 2013
And there’s so much truth to it. I remember myself going through some course on UDEMY back at the time, covering big data and Hadoop&Spark solutions. It seemed that even the smallest companies should have their own cluster, and perform all kinds of God-know-for-what analytics and computations. As with every hype and new idea, the overall outcome was predictable. The tools and patterns emerged, the technology matured and the big data became a regular part of the software engineering process. What is more – most of the use cases do not meet the criteria of the “big” when it comes to the size. Just read this article to get some actual numbers. So what’s left? Ladies and gentlemen, here’s data engineering.
Wait. Only data engineering?
Yeah, I know, I just wanted some drama. Obviously there’s more to the disciples of big data than just data engineering – we got the whole data-related areas, like analytics, ML, BI and to put a cherry on top – AI. However, it is data engineering that lies at the bottom of the whole stack. The actual ways of getting the data and transforming/loading it to the target systems. It is an ever-changing landscape, with lots of technologies, libraries and platforms. Territory completely unknown to me. Therefore, for my holidays, I have chosen to read “Foundations of Data Engineering”, a well-praised book by Joe Reiss and Matt Housley.
So what are we talking about?
The book’s aim (as the authors clearly state right at the beginning) is not to provide very detailed information about data engineering’s techniques. They are trying to picture a very broad view of data related field, especially when it comes to the – well – foundations of it. The book is divided into three separate parts. The first one presents the short history of what we call data engineering today, with the emphasis on the “big data burst” in the mid-2010s. Followed by the depiction of architectural approaches that are used today in the area. The first part lays foundation for the most important part of the book – in depth description of the data’s lifecycle.
This lifecycle revolves around the most important concepts in data engineering – extracting, transforming and loading/storing the data. Several additional areas are used to enrich every step presented above. Security, programming, people involved in the process, etc. Authors are seasoned practitioners, they know the stuff they’re writing about up and down. We get detailed descriptions of data generation, storage, ingestion, modelling and serving of the data. I admit – after the lecture I felt that I know what this whole data engineering is, and for what it is being used.
The last part of the book is quite short, and concentrates on security and identity with the last chapter, where the authors try to predict the future of data engineering in general.
Is there anything missing?
I am not qualified enough to answer that question. From my limited perspective, I feel like the whole topic of data engineering was presented fully. What is more – it was presented in the right way – concentrating on the core concepts, which will be still relevant in the years to come. However, I think that the book could be made better. I understand the author’s idea, not to go too deep into specific technologies used today. The landscape is changing very fast, and that part of the book could become obsolete and out-of-date quite soon. Despite that, I think that adding one separate chapter that would present the typical technology stack used today would be very beneficial for the reader.
Throughout the book we saw numerous libraries, frameworks and off-the-shelf solutions. Unfortunately, they’re never cataloged or presented from the very beginning to the very end of the typical data pipeline. In my opinion, adding that kind of solutions-walkthrough could make the book even better.
Should I read “Foundations of Data Engineering”?
If you are interested in getting some knowledge about a vast area in the current IT world (which data engineering is) – then it’s a big YES. However, be prepared only for a view-from-above, without any nitty-gritty technological details. If the only thing that you’re interested in are specific technologies used by data engineering – I think you should search for more specialized sources of knowledge.
Leave a Reply
You must be logged in to post a comment.