Partitioned SQL datasets

Data Science Studio can provide partitioning support, both on SQL databases that have native partitioning supports and on those that do not.

Performance will generally be better when the SQL database has native partitioning support.

Partitioning SQL datasets is achieved by checking the “Partitioned” checkbox in the dataset edition screen. Here is how to choose the settings.

Partitioning SQL table datasets

Datasets based on SQL tables support at most one dimension of partitioning, which is based on the values of a single column. The partitioning column must be part of the schema of the table.

External SQL table datasets

For an external SQL table dataset, you must configure :

  • Which column provides the partitioning values. For example “country” or “day”.

  • The type and parameters of the partitioning dimension. For example:
    • Discrete values
    • Time with periodicity DAY

It is recommended that you test the connectivity to the table before checking the “Partitioned” checkbox, as additional configuration becomes instantly mandatory (and you cannot see the columns list anymore until you’ve finished configuring partitioning).

Data Science Studio can automatically list the available partitions by enumerating the values of the provided column. In some cases, enumerating the values of a column can be prohibitively costly. This is particularly the case when working on a non-column-oriented database without index on the column. In that case, you may specify a comma-separated explicit list of available partitions. Data Science Studio will use this list whenever it needs to list the partitions of the dataset. Each element of the list can be a partition identifier or a partition range specification. For more information on partition identifiers, see Partition identifiers.

If your database engine natively supports partitioning, and the external table uses it, DSS suggest to use the native partitioning support. Checking this option, will make listing and removing partitions much more efficient. It might also make getting the records of a partition more efficient.

In the dataset edition screen, Data Science Studio automatically selects an arbitrary partition to preview.

Managed SQL table datasets

For partitioning, Managed SQL table datasets behave like external SQL table datasets.

Note

In a managed SQL table dataset, you define the schema of the table yourself. Don’t forget that you MUST have the partitioning column in the table schema. If you don’t have it, testing the dataset will fail.

Partitioning SQL query datasets

SQL query datasets provide additional flexibility when it comes to partitioning (with a more complex setup).

To summarize :

  • You can use multiple partitioning dimensions.
  • The SQL query must use specific patterns to replace the requested values of the partition
  • You must provide a way to list the partitions
  • You must provide a partition identifier that will be used by Data Science Studio to perform the preview in the dataset screen.

Let’s take an example. If we have the following database schema:

event {
    geography_id integer;
    type string;
    user_id integer;
    timestamp integer;
}
geography {
    id integer
    continent string,
    country string,
    region string,
    city string
}
user {
    id integer;
    sex varchar(1);
}

We want to create a dataset with the following data :

  • event_type
  • user_sex
  • timestamp
  • city

And we want that dataset to be partitioned by day and country. We cannot do that directly using SQL table datasets, and need to use SQL query datasets.

Note

Although this is an example, it should not be considered as a good practice : a very important good practice when doing a data analysis project is to denormalize the data as soon as possible. Joining three tables to create an analytical dataset is expensive.

We first need to declare our query-based dataset, and to configure the two partitioning dimensions:

  • A discrete values dimension named “country”
  • A time dimension on DAY level named “day”
../_images/new-sql-query-partitioned-step1.png

Then, we can create our query. The query must return the records for a single partition. When Data Science Studio needs to fetch the records of a partition, it will take the SQL query that we have entered, and replace all _${dimensionName}_ patterns in the query by the value of the dimension for this partition.

So our query will be :

   SELECT event.type as event_type, user.sex as user_sex, event.timestamp as timestamp, geography.city as geography
    FROM event INNER JOIN geography ON geography.id = event.geography_id INNER JOIN user on user.id = event.user_id
WHERE geography.country = '${country}' AND DATE_FORMAT(event.timestamp, 'yyyy-MM-dd') = '${day}'

In the substitution values, a time partition will be given using the Data Science Studio partition identifier syntax (see below).

We then need to provide Data Science Studio with a way to list all available partitions. The preferred way is to provide a SQL query able to list partitions. It should return a result set with one column for each partitioning dimension and one row for each partition.

Here, we could achieve this with :

SELECT day, country FROM (
  SELECT DATE_FORMAT(event.timestamp, 'yyyy-MM-dd') as day, geography.country FROM event INNER JOIN geography ON geography.id = event.geography_id
) GROUP BY day, country;

The subquery + group is required to perform deduplication of multiple records on the same/day country.

In this example, we can see that listing all partitions is a very costly operation as it needs to perform a full scan of the event table. Due to the reformatting that we have to perform to extract the day, we cannot properly use indices.

On a large event database, that cost might be prohibitively high.

In that case, instead of providing a SQL query to list partitions, we can explicitly list the available partitions by entering something like : 2013-01-01/2013-10-01|france,uk,italy,germany (see Partition identifiers for details).

Note that explicitly listing partitions is generally not desirable for “live” datasets, for which new partitions are created each day.

To finish, we need to explicitly give a partition identifier that Data Science Studio will use for preview. For example 2013-04-06|france

Writing in partitioned SQL datasets

There are two main ways to write in a partitioned SQL table:

In the second case, the writes are “controlled” by Data Science Studio (see Recipes for partitioned datasets for more details).

When the write is controlled by DSS, the following is automatically made for you:

  • Creating the table if it’s not yet created
  • Checking that the schema of the table is still valid
  • Dropping pre-existing records corresponding to the partition being written (to ensure idempotence)
  • Setting the output partition in all records being inserted

You do not need to make sure that the partitioning column appears in the records being inserted. DSS will always fill it with the value of the partition being written. However, it is mandatory to have the partitioning column appear in the output schema. (This will automatically be done for you if you create the managed dataset by copying the partitioning from another dataset).

SQL datasets and time partitioning

When using time-based partitioning on a SQL table, the partitioning column must NOT be of Date type. Instead, it must be of “string” type, and contain values compatible with the partition identifier syntax of Data Science Studio, that is : 2013-02-28-14 for hour-partitioning 2013-02-28 for day-partitioning 2013-02 for month-partitioning 2013 for year-partitioning

The same thing is valid for SQL query datasets : the substitution values are given using the partition identifier syntax.

Native partitioning and number of partitions

Most database engines do not support an arbitrary number of partitions when using their native partitioning support. For example, Vertica only supports 1024 partitions per table. This might make it impossible to keep, for example, a hour-level partitioning in the database.