Essential resources for data engineers
This is a curated recommended read and watch list for scalable data processing. It is primarily aimed towards software architects and developers, but there is material also for people in leadership position as well as data scientists, in particular in the first section. The content has been chosen with a bias towards material that conveys a good understanding of the field as a whole and is relevant for building practical applications.
If you wish to discuss the contents or report any broken links, please do so on the announcement post on Google+ or via email to email@example.com. Also feel free to send any material you really think should be in here.
The big picture
End to end
Architecture and patterns
Business perspective, leadership
Head to tail
Applications in the emerging world of stream processing (slides). Good summary of how to build stream processing applications.
Stream processing, event sourcing, reactive, CEP… and making sense of it all This blog post is also part of Martin Kleppman’s free book Making sense out of stream processing, which also encompasses his two entries in the Data collection section.
Dataflow: A unified model for batch and streaming data processing. Google still dominates data processing technology, and looking at them is usually a glimpse into the future for open source technology. Apache Beam is still a bit young, but the semantics described in the presentation are likely to be the next architectural step, as an alternative to the lambda and kappa architectural patterns.
Data product serving, NoSQL
Cassandra data modeling best practices. This is one of the first documents on data modelling for Cassandra. It predates the Cassandra Query Language (CQL), but in order to use CQL efficiently, it is necessary to understand and adapt data models to the underlying structures.
Components and comparisons
Real-time stream processing at InMobi (Storm & Spark Streaming comparison)
Data science, machine learning
Producticity, test, quality, monitoring
Goods: organizing Google’s datasets. Google has higher amibitions on dataset structure than most companies need. The paper, however, gives insight into the kind of entropy that creeps into data processing systems, and examples of structure and tools necessary to keep the chaos under control.
Schema & semantics
Why is there a section here on Scala? Because Scala is rising as the preferred language for scalable data processing. The primary reason for this is not technical, but cultural; successful data-driven products rise out of a collaboration between data scientists and software engineers. The day to day activities of the former group involve model tinkering and experimentation, and the rituals and boilerplate involved in backend languages such as Java prohibits quick experimentation. The latter group, however, is concerned with operational stability, and languages frequently used for experimental purposes, such as Python and R, tend to be perceived as insufficiently rigid and lacking ecosystem support for quality assurance and operations.
Scala is the middle ground where these two worlds meet. It is succinct and expressive enough for experimental purposes, but also statically typed and standing on the JVM platform, providing the quality and operations ecosystems. The lion share of innovation in data processing is therefore expressed in Scala, and it is a matter of time before most data-driven companies adopt it. It is possible to stick to Java or Python for data-driven products, but such decisions come at the cost of deselecting to utilise most of the innovation that happens in the data processing open source world.
Scala is powerful and well suited for data processing, but it comes with risks; it comes with ammunition to shoot yourself in the foot, and an opinionated community that can be both elitistic and cryptic. It is wise to collect input and advice from several different types of sources when adopting Scala.
Transitioning to Scala. Pragmatic advice for teams adopting Scala.
Moving a team from Scala to Golang. This is a cautionary tale of Scala adoption gone wrong; it provides insight into cultural risks that have to be managed. These risks exist in all teams, but Scala provides soil for them to bloom.
Privacy by design. My bag of tricks and patterns for protecting users’ privacy and complying with GDPR in a big data environment.
Papers we love (on distributed systems)