The Javascript API¶
The Dataiku Javascript API allows you to write custom Web apps that can read from the Dataiku datasets.
Fetching dataset data¶
-
dataiku.
fetch
(datasetName, [options, ]success, failure)¶ Returns a DataFrame object with the contents of a dataset
- Arguments
datasetName (string) –
Name of the dataset. Can be in either of two formats
projectKey.datasetName
datasetName. In this case, the default project will be searched
options (dict) –
Options for the call. Valid keys are:
apiKey: forced API key for this dataset. By default, dataiku.apiKey is used
partitions: array of partition ids to fetch. By default, all partitions are fetched.
sampling: object representing sampling to apply. By default, the whole dataset is fetched (no sampling). See below for more details.
filter: formula for filtering which rows are returned.
limit : limit the number of rows to retrieve. By default, this limit is set to 20000 (for safety reasons). See below for more details on sampling.
success (function(dataframe)) – Gets called in case of success with a Dataframe object
failure (function(error)) – Gets called in case of error
The DataFrame object¶
-
class
DataFrame
()¶ Object representing a set of rows from a dataset.
DataFrame objects are created by dataiku.fetch
Interaction with the rows in a DataFrame can be made either:
As “record” objects, which map each column name to value
As “row” arrays. Each row array contains one entry per column
Using row arrays requires a bit more code and using getColumnIdx, but generally provides better performance.
-
DataFrame.
getNbRows
()¶ - Returns
the number of rows in the dataframe
-
DataFrame.
getRow
(rowIdx)¶ - Returns
an array representing the row with a given row idx
-
DataFrame.
getColumnNames
()¶ - Returns
an array of column names
-
DataFrame.
getRows
()¶ - Returns
an array of dataframe rows. Each element of the array is what getRow would return
-
DataFrame.
getRecord
(rowIdx)¶ - Returns
a record object for the row with a given row idx. The keys of the object are the names of the columns
-
DataFrame.
getColumnValues
(name)¶ - Arguments
name (string) – Name of the column
- Returns
an array containing all values of the column <name>
-
DataFrame.
getColumnIdx
(name)¶ Returns the columnIdx of the column bearing the name name. This idx can be used to lookup in the array returned by getRow.
Returns -1 if the column name is not found.
- Arguments
name (string) – Name of the column
- Returns
the columnIdx of the column or -1
-
DataFrame.
mapRows
(f)¶ Applies a function to each row
- Arguments
f (function(row)) – Function to apply to each row array
- Returns
the array [ f(row[0]), f(row[1]), … , f(row[N-1]) ]
-
DataFrame.
mapRecords
(f)¶ Applies a function to each record array
- Arguments
f (function(record)) – Function to apply to each record object
- Returns
the array [ f(record[0]), f(record[1]), … , f(record[N-1]) ]
-
dataiku.
setAPIKey
(apiKey)¶ Sets the API key to use. This should generally be the first thing called
-
dataiku.
setDefaultProjectKey
(projectKey)¶ Sets the “search path” for projects. This is used to resolve dataset names given as “datasetName” instead of “projectKey.datasetName”.
Sampling¶
Returning a whole dataset as a JS object is generally not possible due to memory reasons. The API allows you to sample the rows of the dataset, with option keys.
The sampling key contains the sampling method to use
For more details on the sampling methods, see Sampling
Note
The default sampling is *head(20000)*: by default, only the first 20K rows are returned
sampling = ‘head’¶
Returns the first rows of the dataset
/* Returns the first 15 000 lines */
{
sampling : 'head',
limit : 15000
}
sampling = ‘random’¶
Returns either a number of rows, randomly picked, or a ratio of the dataset
/* Returns 10% of the dataset */
{
sampling : 'random',
ratio: 0.1
}
/* Returns 15000 rows, randomly sampled */
{
sampling : 'random',
limit : 15000
}
sampling = ‘full’¶
No sampling, returns all
sampling = ‘random-column’¶
Returns a number of rows based on the values of a column
/* Returns 15000 rows, randomly sampled among the values of column 'user_id' */
{
sampling : 'random-column',
sampling_column : 'user_id',
limit : 15000
}
Partitions selection¶
In the partitions
key, you can pass in a JS array of partition identifiers
/* Only returns data from two partitions */
{
partitions : ["2014-01-02", "2014-02-04"]
}
Columns selection¶
In the columns
key, you can pass in a JS array of column names. Only these columns are returned
/* Only returns two columns from the dataset */
{
columns : ["type", "price"]
}
Rows filtering¶
In the filter
key, you can pass a custom formula to filter the returned rows
/* Only returns rows matching condition */
{
filter : "type == 'event' && price > 2000"
}