Usage notes per dataset type

DSS can read and write all datasets using Spark.

If you have a Spark engine (which does not need a cluster installation), using Spark on local datasets can still bring performance improvements, for example for the Grouping and Join recipes. Additionally, using Spark will bring you the ability to run SparkSQL, even if you don’t have a Hadoop cluster

However, only HDFS and S3 datasets fully benefit from the Spark distributed nature out of the box. This is because for HDFS and S3 datasets, Spark has builtin support for reading data from these backend, and for splitting the data into multiple partitions.

For other kinds of datasets, since Spark does not natively read and split them, DSS makes them available in Spark using a simplified reader. These datasets are read and written using a single Spark partition (not to be confused with DSS partitions). A single Spark partition will be processed in a single thread (per Spark stage). Furthermore, in some operations, a single Spark partition is restricted to 2GB of data. Therefore, if your dataset is large, you will need to repartition it.

  • In PySpark and SparkR recipes, you need to use the SparkSQL API to repartition a dataframe (generally df.repartition(X) where X is a number of partitions)
  • In SparkSQL, Visual preparation, MLLib and VisualSQL recipes, repartitioning is automatic (in 10 partitions by default). You can configure the repartitioning and the target number of partitions in the various Advanced tabs.

A good rule of thumb is to ensure that each partition will correspond to 100-200 MB of data. Therefroe, if your input dataset (on a non-HDFS non-S3 dataset) is 10 GB, you might want to repartition it in 50-100 (remember that for HDFS or S3 datasets, partitioning is automatically done at the source).