Advanced FM setup

HTTPS and reverse proxies

There are several configurations where you may want to do:

  • If you want to expose FM to your users on a different host and/or port than its native installation, you need to configure a reverse proxy in front of FM. This is the case in particular if you want to expose FM on the standard HTTP/80 or HTTPS/443 ports, as FM should not run with superuser privileges. Section Configuring a reverse proxy in front of Data Science Studio shows configuration examples to this effect for nginx and Apache implementations.

  • Alternatively, FM can natively do HTTPS, please see the “Configuring HTTPS section”

Configuring a reverse proxy in front of Data Science Studio

The following configuration snippets can be adapted to forward Data Science Studio interface through an external nginx or Apache web server, to accomodate deployments where users should access it through a different base URL than that of its native host and port installation (for example to expose Data Science Studio on the standard HTTP port 80, or on a different host name).

Warning

Data Science Studio does not currently support being remapped to a base URL with a non-empty path prefix (that is, to http://HOST:PORT/PREFIX/ where PREFIX is not empty).

HTTP deployment behind a nginx reverse proxy

# nginx reverse proxy configuration for Dataiku Data Science Studio
# requires nginx version 1.4 or above
server {
    # Host/port on which to expose Data Science Studio to users
    listen 80;
    server_name fm.example.com;
    location / {
        # Base url of the Data Science Studio installation
        proxy_pass http://FM_HOST:FM_PORT/;
        proxy_redirect off;
        # Allow long queries
        proxy_read_timeout 3600;
        proxy_send_timeout 600;
        # Allow large uploads
        client_max_body_size 0;
        # Allow protocol upgrade to websocket
        proxy_http_version 1.1;
        proxy_set_header Host $http_host;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

HTTPS deployment behind a nginx reverse proxy

FMcan also be accessed using secure HTTPS connections, provided you have a valid certificate for the host name on which it should be visible (some browsers do not accept secure WebSocket connections using untrusted certificates).

You can configure this by deploying a nginx reverse proxy server, on the same or another host than Data Science Studio, using a variant of the following configuration snippet:

# nginx SSL reverse proxy configuration for Dataiku Data Science Studio
# requires nginx version 1.4 or above
server {
    # Host/port on which to expose Data Science Studio to users
    listen 443 ssl;
    server_name fm.example.com;
    ssl_certificate /etc/nginx/ssl/fm_server_cert.pem;
    ssl_certificate_key /etc/nginx/ssl/fm_server.key;
    location / {
        # Base url of the Data Science Studio installation
        proxy_pass http://FM_HOST:FM_PORT/;
        proxy_redirect off;
        # Allow long queries
        proxy_read_timeout 3600;
        proxy_send_timeout 600;
        # Allow large uploads
        client_max_body_size 0;
        # Allow protocol upgrade to websocket
        proxy_http_version 1.1;
        proxy_set_header Host $http_host;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Note

If all FM users access it over HTTPS, you can enforce session cookies security as described in Advanced security options.

HTTP deployment behind an Apache reverse proxy

The following configuration snippet can be used to forward FM through an Apache HTTP server:

# Apache reverse proxy configuration for Dataiku Data Science Studio
# requires Apache version 2.4.5 or above
LoadModule proxy_module modules/mod_proxy.so
LoadModule proxy_http_module modules/mod_proxy_http.so
LoadModule proxy_wstunnel_module modules/mod_proxy_wstunnel.so
LoadModule rewrite_module modules/mod_rewrite.so

<VirtualHost *:80>
    ServerName fm.example.com
    RewriteEngine On
    RewriteCond %{HTTP:Connection} Upgrade [NC]
    RewriteCond %{HTTP:Upgrade} WebSocket [NC]
    RewriteRule /(.*) ws://FM_HOST:FM_PORT/$1 [P]
    RewriteRule /(.*) http://FM_HOST:FM_PORT/$1 [P]
    ProxyPassReverse / http://FM_HOST:FM_PORT/
    ProxyPreserveHost on
    ProxyTimeout 3600
</VirtualHost>

Configuring HTTPS

By default, FM listens to HTTP connections on the given base port, i.e. is accessible at address http://FM_HOST:FM_PORT. Using installation configuration directives, you can switch FM to accepting HTTPS connection instead, i.e. answering https://FM_HOST:FM_PORT.

You will need to generate and provide a SSL server certificate and private key file matching the domain name used by end users to reach FM. You can then configure FM to switch to HTTPS by adding the following entries to the [server] section of the install.ini installation configuration file:

[server]
ssl = true
ssl_certificate = PATH_TO_CERTIFICATE_FILE
ssl_certificate_key = PATH_TO_PRIVATE_KEY_FILE
ssl_ciphers = recommended

You should then regenerate FM configuration and restart FM, as described in Installation configuration file.

Note

The optional ssl_ciphers = recommended configuration key restricts the set of SSL ciphers accepted by FM to a safe subset, for better protection against known attacks, while staying compatible with most recent browsers and FM-supported Linux platforms.

Setting this key to default (or omitting it altogether) does not configure any restriction on the accepted SSL ciphers, which then fall back to the default list built into the nginx server.

Note

You can also expose FM to users over HTTPS by interposing a reverse proxy. This option is mandatory if you want to use default HTTPS port 443, as FM cannot run with the superuser privileges necessary to listen on this port.

Note

If all FM users access it over HTTPS, you can enforce session cookies security as described in Advanced security options.

Installation configuration file

The installation process for FM can be customized through the DATADIR/install.ini configuration file.

This file is initialized with default values when the data directory is first created. It can be edited to specify a number of non-default installation options, which are then preserved upon upgrades.

Modifying this file requires running a post-installation command to propagate the changes, and restarting FM, as follows:

# Stop FM
DATADIR/bin/fm stop
# Edit installation options
vi DATADIR/install.ini
# Regenerate FM configuration according to the new settings
DATADIR/bin/fmadmin regenerate-config
# Restart FM
DATADIR/bin/fm start

The install.ini installation configuration file is a standard INI-style Python configuration file with [section] headers followed by key = value entries. The following entries are set up by the initial installation and are mandatory:

[general]
nodetype = fm

[server]
port = 10000

Additional installation options are described throughout this manual.

Configuring IPv6 support

By default, FM listens to IPv4 connections only. Using the following installation configuration directive, you can configure FM to listen to IPv6 connections to its base port, in addition to IPv4 connections.

[server]
ipv6 = true

You should then regenerate FM configuration and restart FM, as described in Installation configuration file.

Configuring log file rotation

Main FM processes log files

FM processes write their log files to directory DATADIR/run:

fmmain.log

Main FM process (backend)

nginx.log

HTTP server (nginx)

supervisord.log

Process control and supervision

By default, these log files are rotated when they reach a given size, and purged after a given number of rotations. The following installation configuration directives can be used to customize this behavior:

[logs]
# Maximum file size, default 50MB.
# Suffix multipliers "KB", "MB" and "GB" can be used in this value.
logfiles_maxbytes = SIZE
# Number of retained files, default 10.
logfiles_backups = NUMBER_OF_FILES

You should then regenerate FM configuration and restart FM, as described in Installation configuration file.

Additional log files

In addition to the main log files described above, FM generates one additional log files in directory DATADIR/run, which are handled differently:

  • nginx/access.log : This is the access log for FM HTTP server. Under normal utilization this file grows only slowly compared to the previous ones. It is not rotated automatically, but can be rotated manually through the standard nginx procedure, or using the manual log file rotation command described below.

Manual log file rotation

The following command forces FM to close and reopen its log files (main FM processes log files and nginx access log). Combined with standard tools like logrotate(8), and the possibility to disable automatic log rotation as described above, this lets you take full control over the FM log rotation process, and integrate it in your log file handling framework.

# Use standard Unix commands to rename FM current log files
...
# Force FM to reopen new log files
DATADIR/bin/fm reopenlogs

Java requirements

FM is a Java application, and requires a compatible Java environment to run. Supported versions are OpenJDK and Oracle JDK, versions 8, 10, or 11.

Unless instructed otherwise (see below) the FM installer will automatically look for a suitable version of Java in standard locations. If none is found, it will install an appropriate OpenJDK package as part of its dependency installation phase.

Note

Java 9 is not supported.

While the Java Runtime Environment (JRE) is technically sufficient for FM to run, it is strongly recommended to install the full Java Development Kit (JDK) as this includes additional tools for diagnosing performance and other technical issues. Dataiku support may require you to install the full JDK to investigate some cases.

Choosing the JVM

You can force FM to use a specific version of Java (for example, when there are several versions installed on the server, or when you manually installed Java in a non-standard place) by setting the DKUJAVABIN environment variable while running the FM installer script. This variable should point to the java binary to use. For example:

$ DKUJAVABIN=/usr/local/bin/java dataiku-fm-VERSION/installer.sh <INSTALLER_OPTIONS>

Note that the installer script stores this value in the file DATA_DIR/bin/env-default.sh, so this environment variable is only needed at installation time. It must be provided for all subsequent FM updates however, unless one wishes FM to revert to the automatically-detected version of Java.

Switching the JVM

You can switch an existing FM instance to an different version of Java by rerunning the installer in update mode with a new value for DKUJAVABIN, as follows:

# Stop FM
$ FM_DATADIR/bin/fm stop

# Switch this FM instance to a different Java runtime
$ DKUJAVABIN=/PATH/TO/NEW/java dataiku-fm-VERSION/installer.sh -d FM_DATADIR -u

# Restart FM
$ FM_DATADIR/bin/fm start

Customizing Java runtime options

The FM installer generates a default set of runtime options for the FM Java processes, based on the Java version in use and the memory size of the hosting server. These options can be customized if needed.

The different Java processes

FM is made up of 1 single kind of Java process:

  • The “fmmain” process is the main server

What can be customized

All Java options of this process can be customized.

For each of these, FM provides an easy way to:

  • configure the amount of memory allocated to each process (Java “-Xmx”)

  • add custom options

These customizations can be done by editing the install.ini file.

More advanced customization (taking precedence over default FM options) can be done via environment files.

Customizing maximum memory size (xmx)

Most often, you will want to customize the amount of memory (“xmx”) variable, which is the maximum memory allocated to the Java process.

Xmx is configured by setting the <processtype>.xmx setting in the javaopts section of the install.ini file (where <processtype> is one of backend, jek, fek or hproxy).

The installer sets Xmx to a default value between 2 and 6 GB, depending on the memory size of the host. This might not be enough for FM instances with a large number of users. If that amount of memory is not sufficient, the FM backend may crash, and all users would get disconnected until it automatically restarts.

Example: Set Xmx of backend to 8g

  • Go to the FM data directory

Note

On macOS, the DATA_DIR is always: $HOME/Library/DataScienceStudio/fm_home

  • Stop FM

    ./bin/fm stop
    
  • Edit the install.ini file

  • If it does not exist, add a [javaopts] section

  • Add a line: backend.xmx = 8g

  • Regenerate the runtime config:

    ./bin/fmadmin regenerate-config
    
  • Start FM

    ./bin/fm start
    

Example install.ini

Here is an example of an install.ini file that configures the Xmx for backend and jek:

[javaopts]
backend.xmx = 8g

Memory amounts can be suffixed with “m” or “g” for megabytes and gigabytes

Adding additional Java options

You can add arbitrary options to the FM Java processes. Use the same procedure as above, with <processtype>.additional.opts directives:

[javaopts]
fmmain.additional.opts = -Dmy.option=value

Advanced customization

The full Java runtime options can be configured by setting environment variables in the DATA_DIR/bin/env-site.sh file in the FM data directory.

Warning

You should only use this section if you could not obtain the desired set of options using the options above.

The default runtime options are stored in several environment variables:

  • DKU_FMMAIN_JAVA_OPTS

The default values for these files (computed from install.ini) are stored in the DATA_DIR/bin/env-default.sh.

Warning

Do not modify DATA_DIR/bin/env-default.sh, it would get overwritten at the next FM upgrade and after each call to ./bin/fmadmin regenerate-config

To configure these options:

  • Stop FM

    ./bin/fm stop
    
  • Open the bin/env-default.sh file

  • Copy the line you want to change. They look like export DKU_BACKEND_JAVA_OPTS, export DKU_JEK_JAVA_OPTS, …

  • Open the DATA_DIR/bin/env-site.sh file

  • Paste the line and modify it to your needs

  • Start FM

    ./bin/fm start
    

Adding SSL certificates to the Java truststore

There are a number of configurations where FM needs to connect to external resources using secure network connections (SSL / TLS). This includes (but is not limited to):

  • connecting to a secure LDAP server

  • connecting to Hadoop components (Hive, Impala) over SSL-based connections

  • connecting to SQL databases, MongoDB, Cassandra, … over secure connections

In all these cases, the Java runtime used by FM needs to be able to verify the identity of the remote server, by checking that its certificate is derived from a trusted certification authority. The JVM comes with a default list of well-known Internet-based certification authorities, which normally covers all legitimate publicly-accessible Internet resources. However, resources internal to your organization are typically certified by private certification authorities, or by standalone (self-signed) certificates. It is then necessary to add additional certificates to the trusted list of the JVM used by FM (a.k.a. truststore).

You should refer to the documentation of your JVM and/or Linux distribution for the precise procedure for this. In most cases, you can use one of the following options:

Add a local certificate to the global JVM truststore

You will need write access to the Java installation for this (that would be root access for the typical case where the JVM has been installed through a package manager).

  • check which JVM is used by FM by looking for variable DKUJAVABIN in file DATADIR/bin/env-default.sh

  • locate the physical installation directory of this JVM with : readlink -f /PATH/TO/java. This should resolve to JAVA_HOME/jre/bin/java where JAVA_HOME is the installation directory for this JVM.

  • locate the default truststore file, at JAVA_HOME/jre/lib/security/cacerts

  • prepare the certificate(s) to add, in one of the supported file formats (binary- or base64-encoded DER, typically named .pem, .cer, .crt, or .der, or PKCS#7 certificate chain, typically named .p7b, or .p7c)

  • import your certificate in the JVM trustore with keytool (the certificate store management tool, shipped with the JVM). This command prompts for the trustore password, which by default is changeit on Oracle and OpenJDK distributions.

    keytool -import [-alias FRIENDLY_NAME] -keystore /PATH/TO/cacerts -file /PATH/TO/CERT_TO_IMPORT
    

    You may need to first make this file writable with chmod, if it is write-protected.

    You can check that the import was successful by listing the new truststore contents:

    keytool -list -keystore /PATH/TO/cacerts
    

You need to restart FM after this operation.

Warning

This operation may need to be redone after an update of the JVM, or of the global system-wide certificate trust list.

Note

Instead of directly modifying the default trustore at JAVA_HOME/jre/lib/security/cacerts, you can duplicate it to a file named jssecacerts in the same directory, and update this file instead. When this file exists, it overrides the default one, which lets you preserve the original, distribution-provided version.

For full reference to the management of SSL certificate trust stores, refer to the documentation of your Java runtime. For Oracle JRE, you can refer to:

Add a local certificate to the system-wide certificate trust list

You need to be root for this operation.

Most Unix distributions maintain and distribute a system-wide trusted certificate list, which is in turn used by the various subsystems which need it, including all distribution-installed JVMs. Following distributions-specific procedures to add custom certificates to this list ensures that these additions are not lost upon system or JVM updates, and are available to other subsystems as well (eg command-line tools).

On RedHat / CentOS 7 systems, the global trustore is built with update-ca-trust(8) as follows (refer to the manpage for details):

  • (as root) add any local certificates to trust in directory /etc/pki/ca-trust/source/anchors/

  • (as root) run : update-ca-trust extract

  • optionally, check with: keytool -list -keystore JAVA_HOME/jre/lib/security/cacerts -storepass changeit

On Debian / Ubuntu systems, the global truststore is built with update-ca-certificates(8) as follows (refer to the manpage for details):

  • (as root) add any local certificates to trust in directory /usr/local/share/ca-certificates (or a subdirectory of it), as a file with extension “.crt”

  • (as root) run : update-ca-certificates

  • optionally, check with: keytool -list -keystore JAVA_HOME/jre/lib/security/cacerts -storepass changeit

You need to restart FM after this operation.

Run FM with a private truststore

If you lack administrative access required to update the global truststore (system-wide, or JVM default), you can copy the global trustore to a private location, add your custom certificates to it, and direct FM to use it instead of the default trustore.

  • Using the same steps as the first solution above, locate the default JVM truststore at JAVA_HOME/jre/lib/security/cacerts

  • Copy this file to a private location, for instance $HOME/pki/cacerts, and make it writable

  • Using the same keytool command as the first solution above, add your custom certificates to this private truststore (the default password is again changeit)

  • In order to have FM use it for all Java processes, you need to add command-line option -Djavax.net.ssl.trustStore=/PATH/TO/PRIVATE/TRUSTSTORE to all Java processes, using the procedure documented at Adding additional Java options

    [javaopts]
    backend.additional.opts = -Djavax.net.ssl.trustStore=/PATH/TO/PRIVATE/TRUSTSTORE
    jek.additional.opts = -Djavax.net.ssl.trustStore=/PATH/TO/PRIVATE/TRUSTSTORE
    fek.additional.opts = -Djavax.net.ssl.trustStore=/PATH/TO/PRIVATE/TRUSTSTORE
    hproxy.additional.opts = -Djavax.net.ssl.trustStore=/PATH/TO/PRIVATE/TRUSTSTORE
    
  • Run fmadmin regenerate-config and restart FM to complete the operation

Python setup

FM requires Python 2.7 and comes with a default set of packages, suitable for this version of FM. These are setup by the FM installer and updated accordingly on FM upgrades.