Limitations and attention points¶
Spark is a fairly complex execution engine, tuning and troubleshooting Spark jobs require some experience.
Spark’s additional possibilities come with a few limitations:
- Sampling with filter is not supported for input datasets, prefer a filtering recipe instead.
- HDFS datasets perform much better on Spark than other datasets, for both reading and writing.
- Sampling an HDFS dataset (except with a fixed ratio) can be slower than loading it unsampled.
Spark’s overhead is non-negligible and its support has some limitations (see above and Usage of Spark in DSS). If your data fits in the memory of a single machine, other execution engines might be faster and easier to tune. It is recommended that you only use Spark for data that does not fit in the memory of a single machine.