Formula language

DSS includes a language to write formulas, much like a spreadsheet.

Formulas can be used:

  • In data preparation, to create new columns, filter rows or flag rows
  • More generally, to filter rows in many places of DSS:
    • In the Filtering recipe, to filter rows
    • In Machine Learning, to define the extracts to use for train and test set
    • In the Python and Javascript APIs, to obtain partial extracts from the datasets
    • In the Public API, to obtain partial extracts from the datasets
    • In the grouping, window, join and stack recipes, to perform pre and post filtering

Basic usage

Note

We also have a Howto that lists common use cases with examples

Formulas define an expression, that applies row per row.

Assuming that you have a dataset with columns N1 (numeric), N2 (numeric) and S (string) , here are a few example formulas:

  • 2 + 2
  • N1 + N2
  • min(N1, N2) # Returns the smallest of N1 and N2
  • replace(S, 'old', 'new') # Returns the value of S with ‘old’ replaced by ‘new’
  • if (N1 > N2, 'big', 'small') # Returns big is N1 > N2, small otherwise

Reading column values

In almost all formulas, you’ll need to read the values of the columns for the current row.

When the column has a “simple” name (i.e: starting by a letter, contains only letters, numbers and underscores), you just need to use the name of the column in the fomula: N1 + 4

For other cases, you can use:

  • strval("column with spaces") : returns the value of “column with spaces” as a string
  • numval("column with spaces") : returns the value of “column with spaces” as a number

Variables typing

Variables in the formula language can have one of the following types: string, integer, decimal, array, object, boolean, date

Regardles of the schema of the dataset, or the meanings of the columns (in the case of working on dynamically-generated columns), the following rules are applied:

  • If column values are “standard” decimal, they are automatically parsed to decimal
  • If column values are “standard” integer, they are automatically parsed to integer
  • Else, they are kept as string

Columns containing dates, arrays, objects or booleans are not automatically converted to the matching Formula type. However:

  • all functions that require an array will automatically attempt to convert a string input to array (using the regular DSS JSON syntax)
  • all functions that require a date will automatically attempt to convert a string input to date (using the ISO-8601 format)

In other words, if the column “begin_date” is a date (and thus contains properly-formatted ISO-8601):

  • type(begin_date) returns “string”
  • inc(begin_date, 2, "days") works as expected because the inc function performed automatic conversion to date
  • To manually obtain an array or object, use the parseJson function
  • To manually obtain a date, use the asDate function

Operators

The formula language supports the classical arithmetic operators:

  • Regular math operators: +, -, *, /
  • Comparison operators (evaluate to booleans): >, >=, <, <=, ==, !=
  • Arithmetic operators: // (integer division) and % (modulo)
  • Boolean operators: &&, ||
  • The + operator also performs string concatenation.
  • The comparison operators (and only them) can operate on dates. For dates arithmetic, see the diff and inc functions.

Array and object operations

Formula support accessing array elements and object keys using the traditional Python/Javascript syntax:

  • array[0]
  • object["key"]
  • object.key (only valid if ‘key’ is a valid identifier, i.e. matches [A-Za-z0-9_]*)

Note: this requires that you actually have an array or object. You might need to use the parseJson function (see above paragraph about typing)

Object notations

For all functions, you can use them the “regular” way: replace(str, 'a', 'b') or in the “object” way: str.replace('a', 'b')

In object notation, the first argument to the function is replaced by its ‘context’.

For example, the two syntaxes are equivalent:

length(trim(replace(foo, 'a', 'b')))

foo.replace('a', 'b').trim().length()

Array functions

arrayContains(array a, item) boolean

Returns whether the array a contains item

arrayDedup(array a) array

Returns array a with duplicates removed

arrayIndexOf(array a, item) int

the index (starting from 0) of item in the array a. Return -1 if item is not in a

arrayLen(array a) int

Returns the length of the array a

arrayReverse(array a) array

Reverses array a

arraySort(array a) array

Sorts array a

get(o, number or string from, optional number to) Depends on actual arguments

If o has fields, returns the field named ‘from’ of o. If o is an array, returns o[from, to]. if o is a string, returns o.substring(from, to)

join(array a, string sep) string

Returns the string obtained by joining the elements of array a with the separator sep

For example, the expression join(date_elements, '-') on a column date_elements produces:

date_elements output
[2007, 7, 15] 2007-7-15
[2016, 1, 8] 2016-1-8

objectKeys(object o) array

Returns the keys of an object as an array

objectValues(object o) array

Returns the keys of an object as an array

slice(o, number from, optional number to) Depends on actual arguments

If o is an array, returns o[from, to]. If o is a string, returns o.substring(from, to)

substring(o, number from, optional number to) Depends on actual arguments

If o is an array, returns o[from, to]. If o is a string, returns o.substring(from, to)

Boolean functions

and(boolean a, boolean b) boolean

Returns a AND b

not(boolean b) boolean

Returns the opposite of b

or(boolean a, boolean b) boolean

Returns a OR b

Date functions

asDate(o, format1, format2, ... (all optional)) date

Returns o converted to a date. If you don’t give a format, ISO-8601 is used. You can give an ordered list of possible formats using this syntax: https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html

datePart(date d, string part) number

Returns part of a date. The available parts are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’), ‘weekday’ (the name of the day of the week, capitalized), ‘time’ (number of milliseconds since epoch)

Returned parts are always in local timezone and english

diff(o1, o2, optional string unit) number

Returns the difference between two dates in given time units. The available units are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’)

inc(date d, number value, string unit) date

Returns a date incremented or decremented by the given amount in the given unit of time. The available units are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’)

now() date

Returns the current time

trunc(date d, string unit) date

Truncates a date to a given unit. The available units are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’)

Math functions

abs(number d) number

Returns the absolute value of a number

acos(number d) number

Returns the arc cosine of an angle, in the range 0 through PI

asin(number d) number

Returns the arc sine of an angle in the range of -PI/2 through PI/2

atan(number d) number

Returns the arc tangent of an angle in the range of -PI/2 through PI/2

atan2(number x, number y) number theta

Converts rectangular coordinates (x, y) to polar (r, theta)

ceil(number d) number

Returns the ceiling of a number

combin(number n, number k) number

Returns the number of combinations for n elements as divided into k

cos(number d) number

Returns the trigonometric cosine of an angle

cosh(number d) number

Returns the hyperbolic cosine of a value

dec2hex(long) string

Returns an hexadecimal representation of the input number

degrees(number d) number

Converts an angle from radians to degrees.

even(number d) number

Rounds the number up to the nearest even integer

exp(number n) number

Returns the exponential of a number

fact(number i) number

Returns the factorial of a number

factn(number i) number

Returns the factorial of a number

floor(number d) number

Returns the floor of a number

gcd(number d, number e) number

Returns the greatest common denominator of the two numbers

hash(string) long

Returns a 64 bits numerical hash of the input (not crypto-secure)

hex2dec(string) long

Returns a decimal representation of an hexadecimal string

lcm(number d, number e) number

Returns the greatest common denominator of the two numbers

ln(number n) number

Returns the natural log of a number

log(number n) number

Returns the base 10 log of a number

max(number a, number b) number

Returns the greater of two numbers

min(number a, number b) number

Returns the smaller of two numbers

mod(number a, number b) number

Returns a modulus b

multinomial(number d1, number d2 ...) number

Calculates the multinomial of a series of numbers

odd(number d) number

Rounds the number up to the nearest odd integer

pow(number a, number b) number

Returns a to the power of b

quotient(number numerator, number denominator) number

Returns the integer portion of a division

radians(number d) number

Converts an angle in degrees to radians

rand(optional int min, optional int max) double or int

Without arguments, returns a random float between 0 and 1. With min and max arguments, returns a random integer between min (inclusive) and max (exclusive)

round(number n) number

Returns the rounding of number to the nearest integer

sin(number d) number

Returns the trigonometric sine of an angle

sinh(number d) number

Returns the hyperbolic sine of an angle

sum(array a) number

Sums the numbers of an array. Skips non-number elements from the array.

tan(number d) number

Returns the trigonometric tangent of an angle

tanh(number d) number

Returns the hyperbolic tangent of a value

toNumber(o) number

Returns o converted to a number

Object functions

hasField(o, string name) boolean

Returns whether o has field name

htmlAttr(Element e, String s) string

Selects a value from an attribute on an Html Element

htmlText(Element e) string

Selects the text from within an element (including all child elements)

innerHtml(Element e) string

The innerHtml of an HTML element

jsonize(value) JSON literal value

Quotes a value as a JSON literal value

objectDel(key, [key...]) object

Removes one or several keys from an object and returns it. The keys must not be null

objectNew(k1, v1, k2, v2, ...) object

Creates a new object, optionally pre-filled with key/values. Must get an even number of arguments, as successive key-value pairs. Giving 0 arguments is possible and will return an empty object (you can use objectPut to add to it)

objectPut(object o, key, value) object

Adds a key/value pair to an object and returns it. key must not be null

ownText(Element e) string

Gets the text owned by this HTML element only; does not get the combined text of all children.

parseHtml(string s) HTML object

Parses a string as HTML

parseJson(string s) object or array

Parses a JSON string as an object or array

select(Element e, String s) HTML Elements

Selects an element from an HTML elementn using selector syntax

type(object o) string

Returns the type of o

String functions

chomp(string str, string tail) string

Removes tail from the end of str if it’s there, otherwise leave it alone.

contains(string s, string frag) boolean

Returns whether s contains frag

endsWith(string s, string sub) boolean

Returns whether s ends with sub

escape(string s, string mode) string

Escapes a string depending on the given escaping mode. Supported modes: ‘html’,’xml’,’csv’,’url’,’javascript’

format(string format, object... args) string

Formats a string using printf-like formatting. For example '%4d-%02d'.format(2004,2) gives '2004-02'.

This uses the Java Formatter syntax

fromBase64(string s, optional string charset) string

Returns the string whose Base64 representation is given. By default, the string is read using the UTF-8 charset.

indexOf(string s, string sub) number

Returns the index of the first ocurrence of sub in s. Returns -1 if there is no such occurrence.

lastIndexOf(string s, string sub) number

Returns the index of the last ocurrence of sub in s. Returns -1 if there is no such occurrence.

length(array or string o) number

Returns the length of o

match(string, string or regexp) array of strings

Returns an array of the groups matching the given regular expression

For example, match('hello world', 'he(.*).*(rl)d') returns ['ll', 'rl']

md5(string s) string

Returns the MD5 hash of a string

partition(string s, string or regex frag, optional boolean omitFragment) array

Returns an array of strings [a,frag,b] where a is the part before the first occurrence of frag in s and b is the part after the occurrence. If omitFragment is true, frag is not returned in the array.

replace(string s, string or regex f, string replacement) string

Replaces all occurrences of a substring or regex in a string

replaceChars(string s, string f, string r) string

Returns the string obtained by replacing all chars in f with the char in s at that same position

For example, replaceChars('abcba', 'bc', 'yz') returns ayzya The function can be used to delete characters, by replacing them by nothing, like in: replaceChars('abcba', 'bc', 'y') returns ayya

rpartition(string s, string or regex frag, optional boolean omitFragment) array

Returns an array of strings [a,frag,b] where a is the part before the last occurrence of frag in s and b is the part after the occurrence. If omitFragment is true, frag is not returned in the array.

sha1(string s) string

Returns the SHA-1 hash of a string

split(string s, string or regex sep, optional boolean preserveAllTokens) array

Returns the array of strings obtained by splitting s with separator sep. If preserveAllTokens is true, then empty segments are preserved.

splitByCharType(string s) array

Returns an array of strings obtained by splitting s grouping consecutive chars by their unicode type

splitByLengths(string s, number length1, ...) array

Returns the array of strings obtained by splitting a string into substrings with the given lengths

For example, splitByLengths('abcdefgh', 2, 3, 1) returns ['ab','cde','f'] The function generates one element in the array for each length passed as parameter. If the input string is not long enough, the last elements are empty.

startsWith(string s, string sub) boolean

Returns whether s starts with sub

strip(string s) string

Returns copy of the string, with leading and trailing whitespace omitted.

toBase64(string s, optional string charset) string

Returns the Base64 representation of a string. By default, the string is written using the UTF-8 charset.

toLowercase(string s) string

Converts a string to lowercase

toString(o, string format (optional)) string

Returns o converted to a string

toTitlecase(string s) string

Converts a string to titlecase

toUppercase(string s) string

Converts a string to uppercase

trim(string s) string

Returns copy of the string, with leading and trailing whitespace omitted.

unescape(string s, string mode) string

Unescapes all escaped parts of the string depending on the given escaping mode. Available modes: ‘html’,’xml’,’csv’,’url’,’javascript’

unicode(string s) string

Returns the input string as an array of the unicode codepoints (numbers)

unicodeType(string s) string

Returns an array of strings describing each character of the input string in their full unicode notation

Value access functions

numval(object o) number

Returns the numerical value of a given cell or expression. If the value of the cell or expression is not numerical, an empty value is returned.

For example, with columns type (with values “high” and “low”), cost_high and cost_low, the expression numval('cost_' + type) produces:

cost_low cost_high type output
200 500 low 200.0
100 400 high 400.0
xx 400 low  

strval(object o, optional string defaultValue) string

Returns the numerical value of a given cell or expression, or the default value if the value of the cell or expression is empty.

For example, with columns type (with values “high” and “low”), code_high and code_low, the expression strval('code_' + type, 'N/A') produces:

code_low code_high type output
200 500 low 200
100 4x0 high 4x0
  400 low N/A

Control structures

Control structures allow you to perform advanced operations.

Beware: control structures cannot use object notation !

filter(array a, variable v, expression e) (returns: array)

Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array if the result is truish.

Note

filter(myarray, v, v < 2) is equivalent to this Python syntax: [v for v in myarray if v < 2] and returns the array with only the elements below 2

forEeach(array a, variable v, expression e) (returns: array)

Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array.

Note

forEach(myarray, v, v + 2) is equivalent to this Python syntax: [v+2 for v in myarray]

forEachIndex(array a, variable i, variable v, expression e) (returns: array)

Evaluates expression a to an array. Then for each array element, binds its value to variable name v and its index to variable name i, evaluates expression e, and pushes the result onto the result array.

Note

forEachIndex(myarray, i, v, v + i) is equivalent to this Javascript syntax: myarray.map(function(v, i) { return v+i ;})

forRange(from, to, step, variable v, expression e) (returns: array)

Iterates, tarting at from, incrementing by step each time while less than to. At each iteration, binds the variable v to the iteration value, evaluates expression e, and pushes the result onto the result array.

Note

forRange(0, 100, 3, v, v * 2) is equivalent to this Python syntax: [v * 2 for v in xrange(0, 100, 3)] and returns [0, 6, 12, 18 .... 198]

if(boolean, expression_true, expression_false) (returns: anything)

Evaluates to expression_true if the condition is true, to expression_false otherwise

objectFilter(expression a, variable k, variable v, expression test) object

Evaluates expression a to an object. Then for each element (k, v) of this object, binds its key to variable name k, its value to variable name v, evaluates expression test which should return a boolean. If the boolean is true, pushes (k, v) onto the result object.

with(expression o, variable v, expression e) (returns: any)

Evaluates expression o and binds its value to variable name v. Then evaluates expression e and returns that result.

The with control allows you to “split” a very big expression into more manageable chunks. It also makes reusing the result of a complex computation easier to read and faster to process.

Tests

Beware: tests cannot use object notation

isBlank(expression o) boolean

Returns whether o is null or an empty string

isError(expression o) boolean

Returns whether o is an error

isNonBlank(expression o) boolean

Returns whether o is not null and not an empty string

isNotNull(expression o) boolean

Returns whether o is not null

isNull(expression o) boolean

Returns whether o is null

isNumeric(expression o) boolean

Returns whether o can represent a number