Welcome to the first DSS tutorial.
In this tutorial, we will analyze the order book of a fictional T-shirt making company called “Haiku T-Shirt”.
We will learn step by step how to
- load a file,
- clean it up,
- visualize data
On our way through this hands-on scenario, we will go through the following concepts of Data Science Studio :
- Preparation Script to clean data
- Chart to visualize the output data of the script.
Table of contents
Let’s get started !
If you’re not logged in yet, you should be seeing a login page.
Open the project “First project” or create a new one.
Let’s load our first dataset inside the studio! Click on the Datasets tab (as shown in the picture), to reach the Datasets screens.
You should find a New Dataset button here. As you click on it, a menu will offer you the different ways to import a dataset into Data Science Studio. For this tutorial, we will just ask Data Science Studio to download a CSV file from an URL. Click on the FTP / HTTP / SSH item in the menu.
We need to supply the URL from which DSS will download our file. Copy-paste the following URL into the URL field :
and click on the Download button below.
The download went fine. Let’s now check if Data Science Studio detected our CSV format correctly by clicking on the Preview tab.
The import is almost perfect. The CSV has been detected as using a Tab separator. One thing is wrong though... Apparently the file contained a blank line between the header and the data. Let’s just input 1 in the Skip next lines and this should sort things out.
We can now give a name to our new dataset. Enter haiku_shirt_sales in the field on top of the screen. Finally, we need to save our work by either hitting the Save button or using the shortcut Ctrl-S.
In Data Science Studio, we call Dataset any piece of data that you have, and which is in tabular format. For instance, your SQL Database is not a dataset, but a SQL Table is. A CSV file like
haiku_shirt_sales.csv is a dataset as well.
Here, our imported dataset has been be copied by Data Science Studio in our instance directory. But most of the time, it is not practical.
You might be working with a billion rows from your SQL Tables, your gigantic HDFS-based CSV file, your MongoDB database, etc.
Most of the time, importing a dataset means that you merely inform Data Science Studio of how it can access the dataset.
An external Dataset (like the one we have imported by giving the external source as an HTTP or FTP address)**remembers** the location of the original data.
Data Studio Science can be set so that whenever the original data changes, the corresponding dataset in the Studio is automatically updated.
For more information about datasets, see DSS concepts
Let’s have a look at the data.
In each column, you should see that Dataiku Science Studio detected a meaning (in our case Text, Number, Date (needs parsing)). A gauge indicates the ratio of the column for which the values do not seem to match the meaning. In our dataset, the department has empty values, as well as invalid data.
The Data Scientists’ Murphy’s law states that real world data is never in the right format. Cleaning it up usually consists of a chain of scripts with a lot of business logic, that are always difficult to maintain. Sadly, a large part of the job of the Data Scientist is to clean up data. Data Science Studio has a dedicated tool to make this task more user-friendly.
Let’s get started with data manipulation : click on Create preparation script.
You have just created an empty script on top of the dataset “haiku_shirt_sales”. We will now fill this script by adding steps.
Let’s start by clearing the empty rows. Click on one of the empty cells in the department column. A contextual menu opens. Click on the option marked Remove rows with no values.
Notice that the removal action has appeared in the Script tab on the left. At this point, the rows are still visible but appear struck out. If you’d prefer to really see your dataset as it will appear after processing, you can click on the eye icon on the top left of your step.
Some rows also have their department marked as invalid (red color in the gauge)
Let’s find out which values are considered as invalid. Click on the header of the column department and in the Filter section, choose Invalid rows only.
On the left side, in the Filter tab, we can now see the filtering on the department. Note that nothing has changed in the Script tab. The filter does not change the underlying data, it only filters which data is displayed in the tabular view. A red filter icon on the department column header reminds you that the displayed data is currently filtered on this column.
A department is an administrative division of France. This area code is typically a 2-digit number, but Corsica’s two departments have for respective code 2A and 2B. Therefore, it is expected that these values look invalid for a Number and we will change the meaning to indicate that this column is really textual. Now that we have found out the nature of invalid values, let us remove the filter on the department column by either clicking on the header red filter icon or on the remove cross in the element in the Filter tab.
Let’s force the meaning of the column as Text. Click on column header department, then Change and Text.
Note that our action resulted in the addition of a new step in our script, on the left of the screen.
Let’s now check the category column which is the rightmost one. Click on its column header. A menu showing different actions related to the column appears. Select Analyse, to get an overview of the distribution of the values of the column.
A popup opens containing information about the proportion of all of the categories.
It seems that one of the categories has been wrongly spelled many times. Surely Bk TShirt M is standing for Black T-Shirt M. Let’s merge these two categories into one. Select the two categories to merge together by clicking on the checkboxes on their left and click on the button marked Merge selected. The label offered for newly merged category is Black T-Shirt M which is perfect. Let’s just confirm this operation by clicking on the Merge selected button.
Validate the merge. The two categories have been merged together as expected! Let’s now close the analysis window, and come back to our Data Explorer. Our merge operation now appears as a processing step and processed values have a yellow background.
While using the preparation script screen and cleaning the dataset, we have built a sequence of actions that are registered in the script. Each step in the sequence is called a processor and reflects a single data transformation. The original data is never modified.
This has many benefits:
- First, it is like a Cancel menu on steroids. You can modify / delete a step that you added earlier.
- Second, it is an history of actions that tells you how a dataset is been modified.
- Last but not least, the script can also be used to create a new dataset with all the data transformations.
You will learn more about the power of processors in the Tutorial 102.
Because Visualization is often the key to get great insights about business data, we will now construct charts on top the data we have just cleaned (ie, the output of the script). We want to compare the success of the t-shirts sales with their prices.
The visualization screen to build charts is available by clicking on the Charts tab.
First, we’re going to plot an histogram of the number of items sold per category. Drag the box named category from the left pane, to the X axis drop area and drag the box named #nb_tshirts from the left pane to the Y axis drop area.
Our first chart appears, but that’s not exactly what we aimed for. This plot is showing the average number of items sold per purchase (Indeed, the Y axis variable summary says Avg of nb_tshirt). We would like to have a total number of items sold per category. To correct this, click on the Average of nb_tshirts label and change the setting of the aggregation function from AVG to SUM.
By default, categories are ordered by the number of sold t-shirts. You can change this setting by clicking on category in the X axis zone.
Since we want to focus on t-shirts only, let us remove the hoodies from the chart. Drag the category variable from the left side to the Filter zone on the right side and unclick the Hoodie checkbox.
We are going to split each bin depending on the t-shirt price during purchase. Drag the tshirt_price variable from the left side to the Color label just below the category variable.
Our resulting plot isn’t really easy to interpret. Let us change the plot type. Click on the icon on top right and choose Stacked bars.
We’re almost done. Instead of using ranges of price, let us use the raw prices of the t-shirts. Click on the tshirt_price in the Color and set Binning to None. Use raw values.
We now have the graph we were aiming for! Let us give it a name and share it with colleagues. Click on Export this chart!
Charts help building and sharing quick visual analytics. They give an alternative vision of the tabular view of a script’s output.
Congratulations, you have completed your first Data Science Studio tutorial.
You can also continue exploring this user’s guide for a deeper dive.