By Yaron Haviv, Founder and CTO,iguazio
I spent much of my time recently in conferences talking to customers and analysts and realized they all were saying many of the same things about the challenges of productized, modern analytics solutions.
Fig. 1: Common analytics architecture
Prevalent architecture also has data copied to a data lake periodically, using batch ETL tasks.With this process, data is compressed to immutable columnar formats such as Parquet to accelerate query performance. Events and logs are pushed through streaming storage and processed immediately with a copy stored in the data lake – they are loosely coupled with the data lake to avoid a speed mismatch (Lambda or Kappa architectures).
Data in the lake is usually unstructured, forcing long data “cleaning” and wrangling. This all leads to inaccurate results and consumes many computational resources every time data is processed, not to mention that without proper metadata it is impossible to find the relevant datasets or secure the data.
While there’s a great deal of chatter about real-time streaming and machine learning, the dominant method for data exploration is querying data through batch or interactive queries and applying statistical modeling to the results. Stream processing is still mostly used to address faster data ingestion and transformation, or for simple tasks of evaluating events against an (old) data model to drive some immediate actions.
Machine learning usually is applied on old data sets extracted from the data warehouse or in data lakes – it is simply too difficult to learn from fresh data. AI decisions and prediction need to work at the speed of the event, limiting the decision and feature vectors to smaller datasets or short history that fit in-memory databases. This is part of the reason many recommendation engines provide irrelevant results, mistargeted ads or false detections of fraud.
What I noticed is that once data scientists finish the exploration and modeling phase, different teams come in to redesign for production, starting the project from scratch and using new dev tools and languages that address better error handling or higher performance.
With so many of us accustomed to running interactive queries on old data, the key challenge now is in initiating sub-second actions driven from fresh data models.
Designing continuous analytics from scratch
With continuous analytics, actions and insights are delivered from fresh data in real-time, for production use, as opposed to just work on interactive queries for data exploration and reports. This means eliminating complex and slow data wrangling and parsing as soon as possible, so data entering the system is machine optimized, clean and curated. No more schema on read, dubious data and data swamps.
Instead of periodic ETLs from external databases, this method streams data using continuous data capture (CDC) tools. That means no more time gaps and unknowns, because data is synchronized between operational databases and the analytics system in real-time.
Fig. 2: Continuous Analytics Architecture
When all of this happens, the focus can shift from traditional BI and DW queries to parallelism and continuous operations that address both real-time and interactive responses while streamlining operations.
With continuous analytics solution, tasks are broken into independent micro-services or serverless functions which run concurrently, such as:
Enrichment and Denormalization: Data entering the system often requires additional context. The system may accept sensor information keyed by the sensor ID but also wants the manufacturer, model or other related historical information. The same goes for correlating a user profile with his or her ID to serve custom content. Real-time analytics uses in-memory caching and fast random access to a real-time database to address real-time requirements. This is a huge improvement over current enrichment processes and SQL JOIN queries which consume much more time and resources.
Merge and Aggregate: Many analytics processes refer to a historical state, such as how many times a user performed a certain operation, or the average temperature of a sensor, or the number of cars in a specific geo-location. With continuous analytics, some statistics can now get aggregated and stored with the existing data, ready for immediate use or validation against pre-defined thresholds instead of running an expensive “GROUP BY” database query.
Prediction (inferencing): The event data along with previous state, aggregate results and enriched data together form a wide feature vector. This vector is input to an AI prediction logic using algorithms for regression, classification, etc. And the prediction output results can drive an action or are updated in a real-time database which serves dashboards.
Real-Time Event Processing: Once detected, a threshold, anomaly, or an event can trigger micro-services that address it by alerting a user, conducting repair operations, modifying a data model or other action.
Analytics and Machine Learning: Multiple analytics and ML processes such as data exploration, model training and reinforced learning can be fed with fresh data in the form of streams, images or structured tables and generate new AI models frequently or continuously, those new models are immediately shared with the real-time inferencing services. Detected features can update/tag individual data records and images immediately. They can also trigger an event. Real-time analytics tasks consume the enriched and aggregated results directly and don’t need to wait for query processing.
Real-Time Dashboards: With continuous analytics, since data and its derivatives are always up to date dashboards only need to read and present relevant datasets. API services map complex dashboard elements to several real-time data requests and serve them to clients through ready-to-use JSON responses.
When application microservices and analytics engines are stateless and containerized, they allow for greater elasticity, simpler deployment and continuous development and operation. Tools such as Spark handle many of those tasks, but it is also possible to leverage emerging stream processing tools, new ML and AI libraries (such as TensorFlow), or implement event-driven tasks using “serverless” functions such as Amazon Lambda or a real-time open-source serverless platform called nuclio.
A key attribute of the new architecture is that it uses a real-time unified data store to update, manipulate and query a variety of data objects simultaneously. Rather than working with multiple store classes and complicating application logic and operations, a unified store provides tiering across memory, flash and cloud storage, balancing cost vs performance requirements.
A preferred approach is to use structured/semi-structured records or streaming where possible and avoid unstructured data. This approach eliminates errors, provides better performance and enables more granular security. Using transactional or atomic data update semantics allows for a consistent state across failures and software version upgrades.
Using modern orchestration platforms such as Mesos or Kubernetes, coupled with a CI/CD pipeline or an alternative public cloud offering, allows for continuous development and operations, drives agility and minimizes gaps between research, development and production.
What’s needed are continuous applications which respond in real-time, using modern agile and cloud methodologies. Modern solutions are the ones that move away from the notion of long-running batch and interactive queries. They’re also restructuring the organizational separation between research, dev and ops.
Bio: Yaron Haviv is a serial entrepreneur with deep technological knowledge in the fields of big data, cloud, storage, networking and high-performance.
- The BI & data Analysis Conundrum: 8 Reasons Why Many Big data Analytics Solutions Fail to Deliver Value
- Standardization and Specialization in Analytics, data Science, and BI
- Big data Architecture: A Complete and Detailed Overview