Limitations and attention points¶
Spark is a fairly complex execution engine; tuning and troubleshooting Spark jobs require some experience.
Spark’s additional possibilities come with a few limitations:
Sampling with filter is not supported for input datasets; prefer a filtering recipe instead.
HDFS datasets perform much better on Spark than other datasets, for both reading and writing.
Sampling an HDFS dataset (except with a fixed ratio) can be slower than loading it unsampled.
Warning
Spark’s overhead is non-negligible and its support has some limitations (see above and Usage of Spark in DSS). If your data fits in the memory of a single machine, other execution engines might be faster and easier to tune. It is recommended that you only use Spark for data that does not fit in the memory of a single machine.