Tutorial 103

Welcome to this DSS tutorial.

In this tutorial, we will analyze the historical orders data of a fictional T-shirt making company called “Haiku T-Shirt”.

We will learn step by step how to predict the expected yearly income for new customers using their first interactions with the company and information gathered during their registration.

On our way through this hands-on scenario, we will go through the following concepts of the Data Science Studio:

  • How to build a Predictive model,
  • How data enrichment can improve prediction quality.

Let’s get started!

Predict an expected income

Let us load the dataset containing past users’ interactions and the yearly income they generated. Click on the Datasets tab and click on New dataset and choose FTP / HTTP / SSH. (Don’t forget to work within a Project.)

../../_images/112.png

Supply the URL from which DSS will download our file:

http://doc.dataiku.com/tutorials/data/103/first_interactions.csv

Proceed to download, set the name of the dataset to first_interactions and save it.

../../_images/212.png

Click on the Preview tab. The dataset has the following columns:

  • user_id
  • birth
  • country
  • page_visited (number of visited pages on the website during the first visit)
  • first_item (price of the first purchased item)
  • gender
  • campaign (whether the user came as part of a marketing campaign)
  • revenue (total income after a year)

Each line corresponds to a client.

../../_images/32.png

Our goal is to predict (i.e. guess) the revenue value knowing the other columns. If we could predict this correctly, we could effectively assess the coming turnover for all newcomers one year in advance! Let’s go! Click on the Models tab.

../../_images/42.png

... and click on the New model button.

../../_images/52.png

Choose Prediction.

../../_images/62.png

Fill the form for the New prediction model. It will be based on the first_interactions dataset and aims to predict the revenue variable. Click on Create button.

../../_images/72.png

Fill the Model name input with Predict revenue for newcomers.

Our prediction model now needs to learn from all past sales interactions and incomes to predict revenue from new comers. This learning step for a prediction model is called Training the model. Click on the Train now button.

../../_images/82.png

The training may take a while. Wait until completion.

Prediction quality

On the left side of the screen, we see that we have built two different predictive models. Choose any one.

On the center of the screen, we see a plot with many dots that tells us about the prediction quality. X axis is the actual revenue values. Y axis is the predicted revenue values.

But wait! How could we know prediction quality for customers we have never met? Actually, we can’t. For now, we only mimic this by checking if our predictor is doing its job correctly on some randomly picked current customers (for who we know the yearly revenue).

If the model was able to predict perfectly the actual values, all dots would be on the diagonal. That’s unfortunately not the case. Our model has a score of about 0.5 in a 0 to 1 scale where 1 would be the perfect prediction score and 0 the full inability to predict revenue. Our model was only able to predict very vaguely what revenue could be. (Note: it is normal if you do not get the exact same scores in your experiments than in this tutorial.)

../../_images/92.png

Our model is not perfect but can still be a little informative. Let us see how customers’ information impact the predicted revenue. Click on the Variables importance tab.

../../_images/102.png

We see that for this model, the most important variable to predict revenue is the first item people have purchased. Then comes the fact that people came from a marketing campaign or not, the gender and lastly the number of pages visited on first visit.

Note that we do not see the country or the birth variables in the list. What happened to these variables? Click on the Information tab. There’s plenty of information about the model. For now, you don’t need to understand all of it. Just scroll down a little bit to reach the Features handling section.

../../_images/113.png

We see that three Features (or columns if you prefer) have been rejected during our model construction: birth, country and user_id. Let’s see what happened to these variables. Click on the Features tab at the model bench level.

../../_images/122.png

Prediction model settings

We see here the list of available features in our dataset and how their are handled by the model bench. Let us first analyse the left side of the screen. When we created the prediction model, Data Science Studio automatically:

  • detected the type of the columns used by the model: they are either numerical (this yield the # sign at the left of the feature’s name) or categorical (the A sign at the left of the feature’s name);
  • preset the handling of the columns used by the model: a column is either taken into account to build the model or not. When used, some preprocessing treatments will be made on the features. These are summarized in the Handling part.
../../_images/132.png

Knowing the type of the features is very important for the model construction:

  • If a column is numeric, the model checks how increases and decreases of the numerical values interact with actual revenue.
  • If a column is not numeric, things are a lot more complicated. We call each value a category, in the sense that each row is in the category corresponding to the value on that column. The gender is a good example of such case. That’s fine when there are not too many distinct values. But when there are lots of distinct values, it is hard to consider them as categories. Think of the columns user_id or birth. They are not numerical and should not be considered as category variables. Because of this uncertainty, Data Science Studio decided to drop these variables from the prediction models.

The right side of the screen shows detailed information about a selected feature (you can click on a different one on the left panel). It says again if a feature is handled and gives an explanation when the Data Science Studio has decided the reject.

For example, the birth feature has been rejected because of a high cardinality (i.e. the high number of distinct values).

../../_images/142.png

We know from school that even though there are many countries in the world, the number of countries isn’t so enormous (at least compared to the possible values of birthdate!). So let us decide to include this feature in the model bench. Click on the use checkbox corresponding to the country variable in the left side of the screen.

../../_images/152.png

We see that we have only 132 different values. Fine. Save your settings and let’s retrain our models on the bench by clicking on the Train button.

../../_images/161.png

Wait until completion... Our prediction model is a lot better than before: the dots are more close to the diagonal and the model score should now be around 0.78!

../../_images/172.png

Click on the Variables importance tab. You see the result of the dummification of the variable country. Now there are as many variables as there were categories in the variable country. Living in the U.S. seems crucial in predicting the coming revenue. We can now compare which variables amongst countries are the most influential. In addition, the original variables first_item, page_visited and gender are still visible and we can also compare their influence with the countries variables.

../../_images/182.png

Enrich data to improve prediction quality

We want to have a better prediction! Let us compute the birth year of the customers and include this info in our model. This task has to be done using a script and a recipe as we saw in Tutorial 101 and Tutorial 102. Click on the dropdown menu on the top left and choose Datasets.

../../_images/192.png

Click on the dataset first_interactions and choose Explore it

../../_images/202.png

... and create a Preparation script.

../../_images/213.png

Let us add a processor to extract the birth year (details are in Tutorial 102). We start by parsing the birth date of each customer. Click on the header of the column and select Parse date.

../../_images/222.png

The studio magically guesses which date format fits the best for the values in your dataset. Click OK. A birth_parsed column has appeared with the standard timestamp format. We can now compute customers age. Click on the column header and choose Compute time since.

../../_images/232.png

Set that processor to compare date to now in years and set output column to age.

../../_images/242.png

Delete the birth_parsed column by clicking on its header.

../../_images/252.png

Now export the script as a recipe. Click on Actions and choose Create recipe (Add to Flow)

../../_images/262.png

... and fill the input New dataset name with first_interactions_enriched and click Add.

../../_images/272.png

You are now visualizing the recipe with the script transformation recorded. Move to the Settings / Build tab and Run the recipe

../../_images/282.png

... and when building completes, click on the dropdown menu on the top left and go back to Models.

../../_images/291.png

Click on the New button to create a new prediction model and choose the dataset first_interactions_enriched and the variable to predict revenue.

../../_images/301.png

Let us turn on the country dummification: click on Features tab

../../_images/311.png

... and set country to use.

../../_images/321.png

Save your settings and launch a Train session by clicking on Train.

../../_images/33.png

On training completion, click on View results. Our prediction is a little better with a prediction score close to 0.8!

../../_images/34.png

Wanna see more?

You want an even better model? Learn how to use the trained model? Visit Tutorial 104 to see how far you can go with the model bench!