Tutorial 101

Welcome to the first DSS tutorial.

In this tutorial, we will analyze the order book of a fictional T-shirt making company called “Haiku T-Shirt”.

We will learn step by step how to

  • load a file,
  • clean it up,
  • visualize data

On our way through this hands-on scenario, we will go through the following concepts of Data Science Studio :

  • Dataset
  • Preparation Script to clean data
  • Chart to visualize the output data of the script.

Let’s get started !

Load the dataset

If you’re not logged in yet, you should be seeing a login page.

../../_images/1.png

Open the project “First project” or create a new one.

../../_images/2.png

Let’s load our first dataset inside the studio! Click on the Datasets tab (as shown in the picture), to reach the Datasets screens.

../../_images/3.png

You should find a New Dataset button here. As you click on it, a menu will offer you the different ways to import a dataset into Data Science Studio. For this tutorial, we will just ask Data Science Studio to download a CSV file from an URL. Click on the FTP / HTTP / SSH item in the menu.

../../_images/4.png

We need to supply the URL from which DSS will download our file. Copy-paste the following URL into the URL field :

http://doc.dataiku.com/tutorials/data/101/haiku_shirt_sales.csv

and click on the Download button below.

../../_images/5.png

The download went fine. Let’s now check if Data Science Studio detected our CSV format correctly by clicking on the Preview tab.

../../_images/6.png

The import is almost perfect. The CSV has been detected as using a Tab separator. One thing is wrong though... Apparently the file contained a blank line between the header and the data. Let’s just input 1 in the Skip next lines and this should sort things out.

../../_images/7.png

We can now give a name to our new dataset. Enter haiku_shirt_sales in the field on top of the screen. Finally, we need to save our work by either hitting the Save button or using the shortcut Ctrl-S.

../../_images/8.png

Note

Dataset

In Data Science Studio, we call Dataset any piece of data that you have, and which is in tabular format. For instance, your SQL Database is not a dataset, but a SQL Table is. A CSV file like haiku_shirt_sales.csv is a dataset as well.

Here, our imported dataset has been be copied by Data Science Studio in our instance directory. But most of the time, it is not practical.

You might be working with a billion rows from your SQL Tables, your gigantic HDFS-based CSV file, your MongoDB database, etc.

Most of the time, importing a dataset means that you merely inform Data Science Studio of how it can access the dataset.

An external Dataset (like the one we have imported by giving the external source as an HTTP or FTP address)**remembers** the location of the original data.

Data Studio Science can be set so that whenever the original data changes, the corresponding dataset in the Studio is automatically updated.

For more information about datasets, see DSS concepts

Explore the data and clean your data with scripts

Let’s have a look at the data.

In each column, you should see that Dataiku Science Studio detected a meaning (in our case Text, Number, Date (needs parsing)). A gauge indicates the ratio of the column for which the values do not seem to match the meaning. In our dataset, the department has empty values, as well as invalid data.

../../_images/9.png

The Data Scientists’ Murphy’s law states that real world data is never in the right format. Cleaning it up usually consists of a chain of scripts with a lot of business logic, that are always difficult to maintain. Sadly, a large part of the job of the Data Scientist is to clean up data. Data Science Studio has a dedicated tool to make this task more user-friendly.

Let’s get started with data manipulation : click on Create preparation script.

../../_images/10.png

You have just created an empty script on top of the dataset “haiku_shirt_sales”. We will now fill this script by adding steps.

../../_images/11.png

Let’s start by clearing the empty rows. Click on one of the empty cells in the department column. A contextual menu opens. Click on the option marked Remove rows with no values.

../../_images/12.png

Notice that the removal action has appeared in the Script tab on the left. At this point, the rows are still visible but appear struck out. If you’d prefer to really see your dataset as it will appear after processing, you can click on the eye icon on the top left of your step.

../../_images/13.png

Some rows also have their department marked as invalid (red color in the gauge)

../../_images/14.png

Let’s find out which values are considered as invalid. Click on the header of the column department and in the Filter section, choose Invalid rows only.

../../_images/15.png

On the left side, in the Filter tab, we can now see the filtering on the department. Note that nothing has changed in the Script tab. The filter does not change the underlying data, it only filters which data is displayed in the tabular view. A red filter icon on the department column header reminds you that the displayed data is currently filtered on this column.

../../_images/16.png

A department is an administrative division of France. This area code is typically a 2-digit number, but Corsica’s two departments have for respective code 2A and 2B. Therefore, it is expected that these values look invalid for a Number and we will change the meaning to indicate that this column is really textual. Now that we have found out the nature of invalid values, let us remove the filter on the department column by either clicking on the header red filter icon or on the remove cross in the element in the Filter tab.

Let’s force the meaning of the column as Text. Click on column header department, then Change and Text.

../../_images/17.png

Note that our action resulted in the addition of a new step in our script, on the left of the screen.

../../_images/18.png

Let’s now check the category column which is the rightmost one. Click on its column header. A menu showing different actions related to the column appears. Select Analyse, to get an overview of the distribution of the values of the column.

../../_images/19.png

A popup opens containing information about the proportion of all of the categories.

It seems that one of the categories has been wrongly spelled many times. Surely Bk TShirt M is standing for Black T-Shirt M. Let’s merge these two categories into one. Select the two categories to merge together by clicking on the checkboxes on their left and click on the button marked Merge selected. The label offered for newly merged category is Black T-Shirt M which is perfect. Let’s just confirm this operation by clicking on the Merge selected button.

../../_images/20.png

Validate the merge. The two categories have been merged together as expected! Let’s now close the analysis window, and come back to our Data Explorer. Our merge operation now appears as a processing step and processed values have a yellow background.

../../_images/21.png

Note

Preparation script

While using the preparation script screen and cleaning the dataset, we have built a sequence of actions that are registered in the script. Each step in the sequence is called a processor and reflects a single data transformation. The original data is never modified.

This has many benefits:

  • First, it is like a Cancel menu on steroids. You can modify / delete a step that you added earlier.
  • Second, it is an history of actions that tells you how a dataset is been modified.
  • Last but not least, the script can also be used to create a new dataset with all the data transformations.

You will learn more about the power of processors in the Tutorial 102.

Visualize your data with charts

Because Visualization is often the key to get great insights about business data, we will now construct charts on top the data we have just cleaned (ie, the output of the script). We want to compare the success of the t-shirts sales with their prices.

The visualization screen to build charts is available by clicking on the Charts tab.

../../_images/22.png

First, we’re going to plot an histogram of the number of items sold per category. Drag the box named category from the left pane, to the X axis drop area and drag the box named #nb_tshirts from the left pane to the Y axis drop area.

../../_images/23.png

Our first chart appears, but that’s not exactly what we aimed for. This plot is showing the average number of items sold per purchase (Indeed, the Y axis variable summary says Avg of nb_tshirt). We would like to have a total number of items sold per category. To correct this, click on the Average of nb_tshirts label and change the setting of the aggregation function from AVG to SUM.

../../_images/24.png

By default, categories are ordered by the number of sold t-shirts. You can change this setting by clicking on category in the X axis zone.

../../_images/25.png

Since we want to focus on t-shirts only, let us remove the hoodies from the chart. Drag the category variable from the left side to the Filter zone on the right side and unclick the Hoodie checkbox.

../../_images/26.png

We are going to split each bin depending on the t-shirt price during purchase. Drag the tshirt_price variable from the left side to the Color label just below the category variable.

../../_images/27.png

Our resulting plot isn’t really easy to interpret. Let us change the plot type. Click on the icon on top right and choose Stacked bars.

../../_images/28.png

We’re almost done. Instead of using ranges of price, let us use the raw prices of the t-shirts. Click on the tshirt_price in the Color and set Binning to None. Use raw values.

../../_images/29.png

We now have the graph we were aiming for! Let us give it a name and share it with colleagues. Click on Export this chart!

../../_images/30.png

Note

Charts

Charts help building and sharing quick visual analytics. They give an alternative vision of the tabular view of a script’s output.

Learn more

Congratulations, you have completed your first Data Science Studio tutorial.

Proceed to Tutorial 102 to learn more about the power of script and processors or Tutorial 103 to see how Dataiku Science studio can help you quickly build predictive analytics!

You can also continue exploring this user’s guide for a deeper dive.