Bringing value to clients
There are three distinct generations of data-centric web products: monoliths, service-oriented-architectures (SOA) with data warehouses (DW), and unified log architectures. The latter is also referred to as “data lake” or “big data” architectures.
Monolithic architectures consist of a user interface, an application backend, and a database. The business logic is implemented in the backend, presented in the user interface, and persistent state is stored in the single database.
The value that can be extracted from data is restricted to aggregated information that can be queried from the current state of the database.
SOA / data warehouse
In order to scale out monolithic architectures to larger products and organisations, service-oriented architectures were adopted, subsequently evolving to micro-service architectures. Each service has dedicated database storage, distributing the application state. In order to combine information from services, extract-transform-load (ETL) jobs run on a daily basis, extracting data from databases, and aggregate in a data warehouse.
The data warehouse keeps a history of the aggregated state of the application, and value for analytical purposes can be extracted from the data warehouse. The value is limited to information available in the aggregations, however, and any changes to the data logging and aggregation process typically involve changes by multiple teams to different technical components, making data-driven product development impractical.
Unified log architectures
As hardware costs dropped, and a scalable open source storage solution, Hadoop, was introduced, it became feasible to store the full user activity history of large web applications. In a unified log architecture, all human-generated actions are logged and stored in raw format in a data lake for a long time, privacy regulations permitting.
Value is extracted from data through the construction of pipelines - sequences of batch or streaming jobs that gradually refine data to produce data artifacts of value, e.g. analytics OLAP cubes or recommendation indexes.
The main advantages of a data lake over legacy data environments are:
It is possible to build features based on machine learning, which requires detailed event data for training. Machine learning innovation in open source components, e.g. Spark, and by cloud vendors, effectively assume a data lake for model training.
The data lake is an open resource, and any team can build pipelines and data-driven features, without coordinating multiple teams in different parts of the organisation.
A carefully built architecture allows for rapid iteration on product modification, data collection, and evaluation, allowing product innovation feedback loops of days rather than months.
The data lake philosophy of immutable datasets refined by purely functional data pipelines, as opposed to mutable state in databases, makes it easy to create efficient developer code-test-debug cycles, and also to recover from human or machine errors. This allows for more efficient development cycles than in database-centric environments, resulting in quicker product iterations.
How do I provide value to clients?
I help companies create end user value from data by adopting data processing technology and methodology from unified log / data lake ecosystems. I work with both open source components, e.g. Hadoop, Spark, Cassandra, and Kafka, as well as their counterparts in Google Cloud Platform and Amazon Web Services.
My clients fall into five main categories, described below, along with a description of how I typically work with clients in the category. Each client is unique, and may require adaptations or different approaches to the recipes described, but they serve as suggestions and starting point for discussions.
The first category consists of new-born companies with little or no data, who know that they data they collect will be a core part of their product, and want to start out with state-of-the art components and workflows. These companies usually request assistance with a data processing architectural vision, as well as guidance with the practical steps to creating the first data-driven features.
The first step in a project with a startup company is a brief technical culture interview over email, in order to understand the company, the use cases, and the applicable constraints. I use this interview form as a template, with adaptions depending on client.
The second step is typically an architecture proposal, with suggested components, and a description of how the components are integrated, and how pipelines can be built on the architecture to extract value from data. Architectures that I propose are normally based on the batch processing architecture described in this presentation, the stream processing architecture described in this presentation (video), or a combination of the two.
The proposal includes strategies for avoiding risks in important areas where failure to plan properly can have large subsequent costs, e.g. privacy regulation compliance.
From this point, I propose a plan of iterative steps to build a data-driven feature going via skeleton, proof-of-concept, to a production-quality feature. The implementation proposed is in accordance with the proposed architecture, but built in an agile manner, focusing on a single end-to-end use case at first. See this project proposal template document for more detail. From this point, the client can choose to employ my services to get technical leadership for implementation and deployment, or to get architectural advice and implementation mentorship.
Adolescent data processing companies
The second client category are yesterday’s startups, companies that are operational with modern data processing technologies or methodologies to some degree, but have not realised their full potential.
When I engage with such companies, there are two main options. The first step can be to join as developer or architect on a product where a need for improvement has been identified. It can also be a technical culture interview, same as described above, followed by a workshop to identify parts of the data processing architecture where there is room for improvement. As a second step, I propose how to proceed to improve the client’s architecture, given the information learnt in the workshop, or while working on the product.
I also alert the client of any data-related risks found in the process, and create a plan for how to address them.
Mature companies with legacy data processing
Most mature companies today use data warehouse architectures. I help companies that recognise the limitations of their existing technology and data organisation, and would like to gradually migrate to a data lake architecture. The driving force is usually a desire to build machine learning features, or to enable data-driven product development.
The approach with a mature company client is naturally highly dependent on the company’s organisation and culture. My engagement usually starts with technical education, describing how data-driven innovation companies build data processing architectures, how to transition to such architectures, how to organise teams for data-centric products, etc. After the technical introduction phase, I help to identify suitable candidates for proof-of-concept implementations, and provide implementation guidance through advice or hands-on work. Such demonstrators are typically features where the existing technical data structure is insufficient, e.g. machine learning use cases.
Mature companies with modern data processing
I also help companies that are both mature and working near state-of-the-art in terms of data processing. Such companies are already operating Hadoop or Spark clusters, running production data pipelines serving data to user features. These companies tend to know well what they want to achieve and improve, but may be short of senior engineers to do the work. I join teams as architect or lead developer to implement new data pipeline features, or to improve team workflows and processes.
I also provide advice to organisations that are aware that a technical revolution is taking place, and seek information on how they can benefit from technical advancement in data analytics or artificial intelligence. These engagement are typically limited to small engagements in terms of time. I occasionally provide such advice for free, if time spent is limited and for a good cause.