Tips from the Shelves: Proven Methods for Boosting Data Warehouse Speeds

In this special guest feature, Jason Harris, Evangelist at Panoply, discusses how data collection and analysis are further enhanced when including methods for disseminating, analyzing, and distributing data. Jason is Panoply’s Evangelist and works to serve the data community within Panoply. He’s a former DBA, report developer and code monkey; and now manages Panoply’s social media and events strategy so if you haven’t seen him in person yet, you will likely soon.

Modern business intelligence captures data from everywhere – databases, cloud services, various business intelligence (BI) tools, business, marketing, and analytics apps, email, surveys, and social media platforms. With so many data sources, organizations struggle to find efficient ways to store and manage structured, semi-structured and unstructured data. To combat this problem, newer and more innovative data analytics platforms have emerged to keep pace with the complexity and volume of captured data.

Along with overcoming the data storage hurdle, organizations must also find ways to improve rendering performance and visualization speed to allow faster time to insights for greater business efficiency. Finding data management tools that can reduce the time to visualization is the next significant challenge to BI’s agility and responsiveness, and one that can be solved with the right approach.

The most efficient way to improve dashboard run-times and increase the overall performance of a data warehouse is to optimize distribution and sort keys, compression, query, and view materialization optimizations. In a way, it’s like borrowing best practices from a well-run physical warehouse for how they move massive inventory to save time and money – learn ordering patterns, put the best-selling items close to the loading dock, and move items around as product popularity changes.

Distribution and Sort Keys

The distribution key is essential to optimization because it directly impacts performance and defines how data is spread across nodes. An automated process designates a distribution key based on “learned data” or anticipated data query patterns over time. Sort keys determine the distribution of data within a node, or make database decisions based on that learned data. A user-defined filter can also enhance performance significantly.

Some platforms can add constraints on both the primary and foreign keys through internal processing when other tools utilize it in their query building process. This can be achieved by the user via the user interface or by the algorithm to identify fields with id/id_pattern values. Based on the results, the system can define the field(s) as primary keys/foreign keys (PK/FK).

In a physical warehouse, the warehouse manager looks at what shipped today, last week, and last month, and compares it to the previous year. Those products that sold well and expect to continue to sell well are moved (during down time) closer to the front of the factory. The cost is manpower – time multiplied by salary. Distribution and sort keys perform a virtual move of “product” (information) resulting in decreased run-time.

Compression

Compression is an effective method for reducing the size of stored data. By reducing the data size, you may increase query performance due to reductions in disk I/O. The more data used, the more critical compression becomes. Some platforms handle compression on the initial copy, continue to monitor the data over time, and adjust the compression configuration as needed, particularly after changes or introductions of new data segments. By periodically optimizing via compression, overall responsiveness of the data set to rendering and visualization requests improves.

In a warehouse, compression can be compared with removing dead space from a shelf. Over time, product sales fluctuate, but dead space on a shelf is money lost. Condensing shelves allows more storage without having to increase the amount of shelving. Data storage has a similar cost – the smaller the footprint, the lower the cost.

Queries

Another method for improving performance is to keep track of users’ queries, specifically their use of tables in joins and fields in ON clauses. Based on this information, some platforms define the distribution key and style appropriate for each table and the database. This process allows future queries to better leverage the distribution keys and more easily access data from different nodes. For example, a field that is often used when joining a particular table would be a statistically strong candidate for a distribution key. Queries can also be dynamically defined to cache. Every query executed on top of some platforms goes through a proxy server. The query is analyzed for metadata purposes including execution time, row count, number of executions, etc., and a learning algorithm that evaluates the aggregated statistics and makes the decision to cache the query or remove it from the cache. In this case, layers of memory are utilized throughout its infrastructure which allows it to dynamically allocate the cached results between database storage and the proxy external memory.

Queries are just like “the old guy” in the warehouse. The one you ask where things are located, and he knows because he has been asked the same question so many times. If he is asked a new question, he may have to think about it for a minute, but once he has the answer, he remembers for the next time. Queries are like the virtual old guy; you can ask the new kid where something is located, and he will find it, but the old guy will find it in a fraction of the time.

Materialized Views

To optimize query materialization, query information should be combined with an analysis of the query patterns the system observes during testing. The system could then take advantage of both cached results and materialized views, and decide to cache all or a portion of the queries in memory based on user statistics gathered from the queries. By analyzing usage, run-time, and frequency, the application could decide which queries or views should be materialized or cached.

There are some platforms that can decide if queries are entirely cached in memory, segregated, or if an aggregated query is saved. In the case of a segregated query, the system could enable the database to pull it from memory instead of retrieving the data entirely from the data warehouse. As a result, users experience a significantly faster return on queries. This decision can make a notable difference in query optimization. Aggregate can be materialized using the following methods:

  • Materialized views – The results of the view are calculated and saved as tables (these are refreshed every time new data enters the tables the view depends upon).
  • Materialized queries (cached results) – The application re-calculates frequently used queries and saves the result as a table. By using a proxy, whenever users utilize the same query, they can identify it and query the materialized result automatically.

As data sets continue to grow and segments of data are repeated, optimization becomes much more critical. If optimization does not occur, growing data sets may cause performance slowdowns and impede data visualizations. Platform-based optimizations help users avoid this issue.

Most modem warehouses have items grouped as “People who bought this product also bought these products.” If you are buying a TV, for example, you most likely will need a wall mount which is stocked in the same part of the warehouse along with speakers and Blu-ray players. Materialized Views is a virtual implementation of this. It’s a one-stop for almost everything you asked for.

With limited time and increased responsibilities across today’s enterprises, data scientists must increasingly rely on integrated tools to auto-generate database queries, rather than on their ability to write or understand SQL. Boosting data warehouse speed is not only possible, it is inevitable when using a platform that begins to learn the data and act upon information derived from the queries. As these processes run, and the platform continues to learn, the result is continuous, significantly improved performance.