Troubleshooting

Jobs fail to run

requests.exceptions.ConnectionError

This issue means the running container is not able to connect to the DSS backend. There is many possible reasons, including:

  • The container tries to connect to a name that cannot be resolved to an IP address from the container.
  • The network is not routing traffic out of the cluster towards the machine hosting DSS.
  • A firewall is blocking access to the machine hosting DSS. It can be cloud network rules as well as a local firewall.

This list is not exhaustive. However the most common issue is that the host name cannot be resolved as is by the container. To fix this, you can add the following variable in DATADIR/bin/env-site.sh.

export DKU_BACKEND_EXT_HOST="xxx.xxx.xxx.xxx" # DNS name or IP address of DSS backend, reachable from the containers

Restart DSS when you are done. You can test if the networking works as expected by clicking on the Test button available at the top right corner of each configuration in Administration > Settings > Containerized execution.

Kubernetes job failed, exitCode=1, reason=Error

This message means that the process inside the container exited with an error return code. You will likely find in previous log lines a Python stack trace giving more information about what happened. The most common issue is the requests.exceptions.ConnectionError above.