Sampling and charts engines¶
DSS features a powerful aggregation engine called the DSS Charts Engine. It uses a highly-optimized column-based and compressed storage format, which enables it to perform blazing fast aggregations and other visual analytics queries. The Charts Engine takes full advantage of modern CPU caches.
Unlike other analytics engine, DSS Charts Engine does not require that the chart data be loaded in memory, but is instead able to efficiently stream data from disk and perform queries on the fly. This allows you to perform visual analytics on very large data extracts that would not fit in RAM using commodity hardware.
DSS Charts Engine extracts data from your data source, transforms it in its optimized format, and then performs all queries using the pre-optimized data. Once data has been loaded in the Charts Engine, it won’t need to access your data source anymore, unless the source data changes.
The DSS Charts Engine can therefore perform visual analytic queries on all data sources that DSS supports, even data sources that are not at all suited for analytics, like CSV files.
In addition to the DSS Charts Engine, DSS can perform visual analytic queries directly in the database, using DB-specific SQL queries. You can switch between engines with a simple click, which allows you for example to prepare your charts on multi-gigabyte samples using the incredibly fast DSS Charts Engine and then switch to your native database for full-dataset analytics.
In-database processing is available for the following datasets:
- HDFS - Using Cloudera Impala, if it is installed and the HDFS data source is compatible with Impala.
By default, when you open the Visualize tab of a dataset or data preparation script, the chart will show the charts on the same sample as the one used in the Table view. You can change the data sample used for charts by clicking on the Sample tab.
Note that you can also compute the charts on the whole dataset (no sampling)
For example, here, the chart will be computed on 1M records, evenly sampled from the whole dataset.
Depending on the dataset, the sampling settings, and the preparation script, DSS will automatically suggest you to switch between DSS Charts Engine and Live in-database processing.
The DSS Charts Engine does not require data to fit in memory, however it stores its optimized format on the disk on which DSS resides.
Therefore, for large samples, you need to make sure that you have enough space on this disk to store your data extracts.