Working with proxies

There are several configurations where you may need Data Science Studio to work along with Web proxies.

  • If your users need to go through a (direct) proxy to reach the DSS interface. The only real issue here is that your proxy must support the WebSocket protocol, and allow long-lived WebSocket connections.
  • If you want to expose DSS to your users on a different host and/or port than its native installation, you need to configure a reverse proxy in front of DSS. This is the case in particular if you want to expose DSS on the standard HTTP/80 or HTTPS/443 ports, as DSS should not run with superuser privileges. Section Configuring a reverse proxy in front of Data Science Studio shows configuration examples to this effect for nginx and Apache implementations.
  • If Data Science Studio is installed on a server without direct outgoing Internet access, it may need to go through a proxy to reach external resources. Section Configuring a proxy for DSS to access external resources describes the configuration steps required for this.

Note about Websockets

Data Science Studio uses the WebSocket protocol for parts of its user interface. This web protocol is fairly recent, and not yet supported by all HTTP proxies.

Make sure any direct or reverse proxy configured between Data Science Studio and its users correctly supports WebSocket, and is configured accordingly.

At the time of writing, this includes:

See /troubleshooting/websockets for related details and troubleshooting advice.

Configuring a reverse proxy in front of Data Science Studio

The following configuration snippets can be adapted to forward Data Science Studio interface through an external nginx or Apache web server, to accomodate deployments where users should access it through a different base URL than that of its native host and port installation (for example to expose Data Science Studio on the standard HTTP port 80, or on a different host name).

Warning

Data Science Studio does not currently support being remapped to a base URL with a non-empty path prefix (that is, to http://HOST:PORT/PREFIX/ where PREFIX is not empty).

HTTP deployment behind a nginx reverse proxy

# nginx reverse proxy configuration for Dataiku Data Science Studio
# requires nginx version 1.4 or above
server {
    # Host/port on which to expose Data Science Studio to users
    listen 80;
    server_name dss.example.com;
    location / {
        # Base url of the Data Science Studio installation
        proxy_pass http://DSS_HOST:DSS_PORT/;
        proxy_redirect off;
        # Allow long queries
        proxy_read_timeout 3600;
        proxy_send_timeout 600;
        # Allow large uploads
        client_max_body_size 0;
        # Allow protocol upgrade to websocket
        proxy_http_version 1.1;
        proxy_set_header Host $http_host;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

HTTPS deployment behind a nginx reverse proxy

DSS can also be accessed using secure HTTPS connections, provided you have a valid certificate for the host name on which it should be visible (some browsers do not accept secure WebSocket connections using untrusted certificates).

You can configure this by deploying a nginx reverse proxy server, on the same or another host than Data Science Studio, using a variant of the following configuration snippet:

# nginx SSL reverse proxy configuration for Dataiku Data Science Studio
# requires nginx version 1.4 or above
server {
    # Host/port on which to expose Data Science Studio to users
    listen 443 ssl;
    server_name dss.example.com;
    ssl_certificate /etc/nginx/ssl/dss_server_cert.pem;
    ssl_certificate_key /etc/nginx/ssl/dss_server.key;
    location / {
        # Base url of the Data Science Studio installation
        proxy_pass http://DSS_HOST:DSS_PORT/;
        proxy_redirect off;
        # Allow long queries
        proxy_read_timeout 3600;
        proxy_send_timeout 600;
        # Allow large uploads
        client_max_body_size 0;
        # Allow protocol upgrade to websocket
        proxy_http_version 1.1;
        proxy_set_header Host $http_host;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Note

If all DSS users access it over HTTPS, you can enforce session cookies security as described in /advanced/security_options.

HTTP deployment behind an Apache reverse proxy

The following configuration snippet can be used to forward DSS through an Apache HTTP server:

# Apache reverse proxy configuration for Dataiku Data Science Studio
# requires Apache version 2.4.5 or above
LoadModule proxy_module modules/mod_proxy.so
LoadModule proxy_http_module modules/mod_proxy_http.so
LoadModule proxy_wstunnel_module modules/mod_proxy_wstunnel.so

<VirtualHost *:80>
    ServerName dss.example.com
    ProxyPass / ws://DSS_HOST:DSS_PORT/
    ProxyPassReverse / ws://DSS_HOST:DSS_PORT/
    ProxyPreserveHost on
    ProxyTimeout 3600
</VirtualHost>

Configuring a proxy for DSS to access external resources

Data Science Studio proxy configuration for remote datasets

If Data Science Studio runs inside your private network, you may need to configure an outgoing proxy for it to be able to access external HTTP- or FTP-based network resources.

This applies in particular to HTTP, HTTPS and FTP remote datasets, Amazon S3 datasets and Twitter streams.

You can define a global proxy configuration for DSS in the “Settings” tab of the Administration page. Choose Proxy, fill in the fields, and save.

Every HTTP(S)- and FTP-based connection will now have an additional “Use global proxy” checkbox. Uncheck it if that connection should not go through the proxy (e.g. for services that are inside your private network). This also applies to Amazon EC2/S3, Elasticsearch, and Twitter connections.

Note

SOCKS proxies are not supported in Data Science Studio.

Warning

A note on FTP through HTTP Proxy

Connecting to a FTP server through an HTTP proxy requires passive mode, and requires the proxy to allow and support HTTP CONNECT method on ports 20, 21 and all unpriviledged ports (1024-65535).

Below is a sample Apache 2.4 configuration for this:

Listen 3128
<VirtualHost *:3128>
  ProxyRequests On
  ProxyVia On
  AllowConnect 20 21 443 1024-65535
  <Proxy *>
    Order deny,allow
    Deny from all
    # IP of internal network
    Allow from 1.2.3.4
  </Proxy>
</VirtualHost>

Proxy configuration for Python and R processes

The above global proxy configuration applies only to native connections made from the Data Science Studio backend. If you need to go through a proxy for network connections done from Python or R code (from recipe or notebook), you should configure it using standard configuration directives for these environments. This includes adding explicit proxy parameters to the network calls, e.g. for Python requests:

requests.get(URL, proxies={'http', 'http://MYPROXY:MYPROXYPORT'})

and/or globally configuring proxy directives through the standard http_proxy (https_proxy, ftp_proxy …) environment variables, e.g.:

# Add the following directive to DATADIR/bin/env-site.sh
# or to the session initialization file of the DSS Unix user (.profile or equivalent)
export http_proxy="http://MYPROXY:MYPROXYPORT"

Refer to Python / R reference manuals for details.

Note

This also applies to network accesses needed to download and install additional Python or R packages.