Amazon SageMaker is a service to build, train, and deploy machine learning models. By integrating SageMaker with Dataiku DSS via the SageMaker Python SDK (Boto3), you can prepare data using Dataiku visual recipes and then access the machine learning algorithms offered by SageMaker’s optimized execution engine.
If not yet installed, be sure to install AWS Command Line Interface (CLI).
$ pip install awscli
Once installed, AWS CLI can be used to quickly configure your installation
$ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: json
Further details on Keys, regions, and output format can be found in the AWS CLI documentation.
See Code environments for more in-depth information on Code Environments in Dataiku DSS. SageMaker’s Python SDK can be installed as a Python environment within a Dataiku DSS instance.
Within your instance go to Administration > Code Envs > New Python Env.
The new environment should be deployed as Managed by DSS. SageMaker’s SDK is compatible with Python 2.7 and Python 3.6. Installation of Mandatory Packages and Jupyter support should remain checked.
Once the environment has been created, more specific package requirements can be installed via pip by simply adding the name of the package to the list of requested packages. Clicking Save and Update will trigger the installation of the requested package as well as any dependencies required for SageMaker.
Finally, after the package(s) has been installed in the code environment successfully, the created code environment can be used for specific recipes, or as the default code environment for an entire project. See the following instructions on how to select a code environment for more details.
The initial data and training data for models created using SageMaker must be contained in an S3 bucket. Dataiku DSS already has the ability to connect to Amazon S3, import a dataset from an S3 bucket, and write back to S3. When creating recipes within the Flow, be sure to select aws in the Store into field. Settings within the recipe can be changed to specify the bucket and path to store the data when writing back to S3.
Role information, paths, and imports will need to be included in the Jupyter Notebook. To retrieve the role, normally you can use
get_execution_role() which will return the IAM role name that was passed in as part of the notebook creation from inside a SageMaker notebook. Outside of these notebooks, that function will not work and the IAM role name (found in your AWS Security Credentials page) will need to be directly passed in via copy and paste. Once proper credentials and paths are set, training and deploying a model should be exactly the same when working in Dataiku DSS as it would be in a SageMaker Notebook Instance.
Once the role, bucket, and prefix variables are set, training a model is very straightforward in a Dataiku DSS-hosted Jupyter notebook. A custom algorithm can be created and used on SageMaker or one of the 12 built in algorithms from SageMaker can be used. Instructions on how to train a model in a Jupyter notebook can be found in the AWS documentation. The instructions for the Python SDK should be referenced.
Deployment of an already trained SageMaker model can be done from within Dataiku DSS using a batch transform deployment. This deployment can happen in the same Jupyter notebook in which training of the model was done or in a new python recipe. A batch transform can be done almost exactly the same as it is done in a SageMaker hosted Jupyter Notebook instance. The output of the transform job, once completed successfully, will be the results of the model’s performance on a test dataset.
Code errors or failures will be reported in the notebook like normal but if the errors are unclear, more specific information can be found by logging onto your AWS Management Console > Amazon SageMaker. From there you can select the type of job that failed (e.g. Training, Batch Transform, etc.) and then selecting the specific failed jobs. Similarly, successful jobs can be viewed for performance information, ARN details, and more from the Management Console. If jobs run in Dataiku DSS are not showing up on your Management Console, this is an indicator that the connection has not been set up correctly.