Welcome to this DSS tutorial.
In this tutorial, we will learn
- how to enrich your data
- how to manage multiple datasets and combine their data
- how to create a new dataset based on script transformations
On our way through this hands-on scenario, we will go through the following concepts of Data Science Studio:
- advanced processors
- recipe (a procedure to transform data of one or many datasets into new datasets)
- updating datasets when your original source data has changed
Table of contents
Let’s get started!
We start by loading a dataset with customers information. It has three columns:
- department (french district)
The goal of this section is to enrich the dataset by:
- adding information about departments
- computing the age of each customer.
In a project, click on the Datasets tab to reach the Datasets screens. Click on the New Dataset button, choose FTP/ HTTP / SSH option and supply the URL where our file is hosted:
Proceed to the download, set the name of the dataset to customers and save it.
We are going to enrich the data so click on the Create preparation script button.
Change the meaning of the department column to Text
We are going to use Open Data to enrich information about customers departments. Click on the add button in the Script tab
The list of all available data manipulation processors appears. Click on the Enrich category and choose Enrich information about a French department.
A new processor has appeared in the Script tab. Define the parameters of the processor by choosing the department column and checking the box to get demography data. New columns with a yellow background should be visible now.
We want to keep only the population of the department in year 2009. Let us remove all other columns. Click on the Data cleansing button and choose Mass remove columns.
Fill the mass removal form to remove all columns starting by department but department and department_population_2009 and click on OK.
We are now going to compute the age of each customer. We start by parsing the birth date of each customer. Click on the header of the column and select Parse date.
The studio magically guesses which date format fits the best for the values in your dataset. Strangely enough, the format is an English date format! Date processing always surprises you. Click OK.
A birth_parsed column has appeared with the standard timestamp format. We can now compute customers ages. Click on the column header and choose Compute time since.
Set that processor to compare date to now in years and set the output column to age. You’re done!
In this section, we are going to
- Compute the department population and age for all customers.
- Create a dataset with both information about customers and information about sales.
We are pretty happy with our data enrichment, but it has only been computed for the customers in the displayed sample!
Working in RAM is crucial to have instantaneous visual feedback in the Explore screen as we are building up the data preparation script. For this reason, Data Science Studio only loads a sample of the dataset...
But don’t worry, when we run our recipe for good, Data Science Studio will run your script on gigabytes of data in seconds with a very small memory footprint.
By default, only the first 30,000 records of your dataset are loaded. This can be configured in the Sampling tab.
Let us export this enrichment script we’ve just built to a Recipe. This Recipe will enable us to compute enriched values for all customers and save the results in a new dataset. Click on the button Save as Recipe.
Fill the form by giving the name customers_enriched to the new dataset and click the Add button.
Your preparation script is now exported as a Recipe. We are now in the “Recipes” window in the tab Setting/build.
Click on the Setting / build to understand what your recipe handles.
Like cooking, you need ingredients (our input dataset customers) and if you’re a chef you’ll produce out of these a proper meal! The meal is the output dataset customers_enriched. Let’s run our recipe by clicking on the Run button.
Your new dataset has been successfully built. You can explore it by clicking on the link Explore dataset customers_enriched.
Data has been enriched. For each customer, we now have the department population and his age. We are now able to use this data elsewhere (we will do that in a minute). For now, let us have a look at what happened under the hood. Click on the dropdown menu on top left and choose Flow.
You see here a graph with the Datasets that are available in your current Project (“First project”) within Data Studio Studio. It lets you visualize how your datasets are processed from input datasets to output datasets. This Flow helps to have an overview on your full project.
Click on the dataset haiku_shirt_sales and then on the Explore button on the right side.
You are back on the dataset of the tutorial 101. Let us add the customers information to this dataset. Open the preparation script we have previously built by clicking on the Edit last preparation script button
Then click on the Add button in the script tab, filter to find all the Join processors and select the Join (memory-based) processor.
We are going to set the processor so that for each sales line we get the customer information (its age and its department) when the user_id value matches. Fill the form:
- Join column (here) is user_id,
- Dataset to Join with is customers_enriched,
- Join column (in other dataset) is also user_id,
- Columns to retrieve are department_population_2009 and age,
- Prefix is customer_.
The columns department_population_2009 and age are added on the right.
Let us export this script to a Recipe to compute the dataset with sales information and customers information. Click on the Save as recipe button :
... and set the name of the New dataset to sales_and_customers and click Add.
A new recipe has been added. Go to the Settings / Build tab and click on the Run button to build the dataset.
When the building job completed, click on the Actions button and choose View in Flow to see how this is reflected in the data flow.
Notice that the dataset sales_and_customers is made out of the two datasets.
Data-driven Flow reconstruction
Data flow is a very handy way to see how data has been processed in Data Science Studio.
More specifically, the graph of the flow has another benefit. When data changes over time, Data Science Studio can update the datasets that would be affected by these changes. Because the flow reflects these dependencies, updates will only be run when necessary.
In other words, the Dataiku Science Studio is data driven: depending on the settings, the Studio will monitor the data changes and dependencies to keep everything up to date.