Reads a dataset from Dataiku's Data Science Studio

dkuReadDataset(name, partitions = NULL, samplingMethod = c("full", "fixed",
  "head", "ratio"), columns = NULL, nbRows = NULL, ratio = NULL,
  convertEmptyStrings = TRUE, colClasses = NA,
  inferColClassesFromData = TRUE, na.strings = "NA")

Arguments

name

name of dataset

partitions

character vector of partitions to load

samplingMethod

the sampling method to use, if necessary

columns

a character vector of columns to read from dataset

nbRows

An integer. The number of rounds used for sampling

ratio

A numeric. The probability used for sampling each row. 0 < ratio < 1.

convertEmptyStrings

Whether to convert empty strings to NAs

colClasses

Manually-specified column classes. Default is to infer from dataset schema.

inferColClassesFromData

If colClasses is not specified, infer column classes from data instead of dataset schema.

na.strings

Optional list of strings to convert to NAs. Default is "NA".

Value

A data.frame with the requested data

Details

Users can specify which partitions and columns to load, as well as a sampling scheme if the dataset is too large to fit into memory. Possible sampling schemes are fixed sampling, where a set number of rows are randomly chosen from the dataset; head sampling, where the first *n* rows are sampled from the dataset; and ratio sampling, where rows are included randomly with a probability.

Examples

# NOT RUN {
d = dkuReadDataset("iris")

# read in two columns
d = dkuReadDataset("iris", columns=c("Sepal.Length", "Sepal.Width"))

# explicitly set colClasses
d = dkuReadDataset("iris", colClasses=c("numeric", "numeric", "numeric", "numeric", "character"))

# fixed sampling -- read 100 random rows from the iris dataset
d = dkuReadDataset("iris", samplingMethod="fixed", nbRows=100)

# head sampling -- read the first 100 rows from the iris dataset
d = dkuReadDataset("iris", samplingMethod="head", nbRows=100)

# ratio sampling -- read 30% of the rows (chosen randomly) from the iris dataset
d = dkuReadDataset("iris", samplingMethod="ratio", ratio=0.3)
# }