Reads a dataset from Dataiku's Data Science Studio
dkuReadDataset(name, partitions = NULL, samplingMethod = c("full", "fixed", "head", "ratio"), columns = NULL, nbRows = NULL, ratio = NULL, convertEmptyStrings = TRUE, colClasses = NA, inferColClassesFromData = TRUE, na.strings = "NA")
name | name of dataset |
---|---|
partitions | character vector of partitions to load |
samplingMethod | the sampling method to use, if necessary |
columns | a character vector of columns to read from dataset |
nbRows | An integer. The number of rounds used for sampling |
ratio | A numeric. The probability used for sampling each row. 0 < ratio < 1. |
convertEmptyStrings | Whether to convert empty strings to NAs |
colClasses | Manually-specified column classes. Default is to infer from dataset schema. |
inferColClassesFromData | If colClasses is not specified, infer column classes from data instead of dataset schema. |
na.strings | Optional list of strings to convert to NAs. Default is "NA". |
A data.frame with the requested data
Users can specify which partitions and columns to load, as well as a sampling scheme if the dataset is too large to fit into memory. Possible sampling schemes are fixed sampling, where a set number of rows are randomly chosen from the dataset; head sampling, where the first *n* rows are sampled from the dataset; and ratio sampling, where rows are included randomly with a probability.
# NOT RUN { d = dkuReadDataset("iris") # read in two columns d = dkuReadDataset("iris", columns=c("Sepal.Length", "Sepal.Width")) # explicitly set colClasses d = dkuReadDataset("iris", colClasses=c("numeric", "numeric", "numeric", "numeric", "character")) # fixed sampling -- read 100 random rows from the iris dataset d = dkuReadDataset("iris", samplingMethod="fixed", nbRows=100) # head sampling -- read the first 100 rows from the iris dataset d = dkuReadDataset("iris", samplingMethod="head", nbRows=100) # ratio sampling -- read 30% of the rows (chosen randomly) from the iris dataset d = dkuReadDataset("iris", samplingMethod="ratio", ratio=0.3) # }