Formula language¶

DSS includes a language to write formulas, much like a spreadsheet.

Formulas can be used:

• In data preparation, to create new columns, filter rows or flag rows
• More generally, to filter rows in many places of DSS:
• In the Filtering recipe, to filter rows
• In Machine Learning, to define the extracts to use for train and test set
• In the Python and Javascript APIs, to obtain partial extracts from the datasets
• In the Public API, to obtain partial extracts from the datasets
• In the grouping, window, join and stack recipes, to perform pre and post filtering

Basic usage¶

Note

We also have a Howto that lists common use cases with examples

Formulas define an expression, that applies row per row.

Assuming that you have a dataset with columns N1 (numeric), N2 (numeric) and S (string) , here are a few example formulas:

• `2 + 2`
• `N1 + N2`
• `min(N1, N2)` # Returns the smallest of N1 and N2
• `replace(S, 'old', 'new')` # Returns the value of S with ‘old’ replaced by ‘new’
• `if (N1 > N2, 'big', 'small')` # Returns big is N1 > N2, small otherwise

In almost all formulas, you’ll need to read the values of the columns for the current row.

When the column has a “simple” name (i.e: starting by a letter, contains only letters, numbers and underscores), you just need to use the name of the column in the fomula: `N1 + 4`

For other cases, you can use:

• `strval("column with spaces")` : returns the value of “column with spaces” as a string
• `numval("column with spaces")` : returns the value of “column with spaces” as a number

Warning

`mycolumn` evaluates to the value of the column named mycolumn

Thus, `strval(mycolumn)` would not work because it would try to read as a string the value of the column whose name is the value of the mycolumn column.

Instead, use `strval("mycolumn")`.

You may also need to use `strval` to avoid DSS automatically typing a numerical variable. See below for more details.

Variables typing and autotyping¶

Variables in the formula language can have one of the following types: string, integer, decimal, array, object, boolean, date

Regardles of the schema of the dataset, or the meanings of the columns (in the case of working on dynamically-generated columns), the following rules are applied:

• If column values are “standard” decimal, they are automatically parsed to decimal
• If column values are “standard” integer, they are automatically parsed to integer
• Else, they are kept as string

Columns containing dates, arrays, objects or booleans are not automatically converted to the matching Formula type. However:

• all functions that require an array will automatically attempt to convert a string input to array (using the regular DSS JSON syntax)
• all functions that require a date will automatically attempt to convert a string input to date (using the ISO-8601 format)

In other words, if the column “begin_date” is a date (and thus contains properly-formatted ISO-8601):

• `type(begin_date)` returns “string”
• `inc(begin_date, 2, "days")` works as expected because the inc function performed automatic conversion to date

Avoiding auto-typing¶

There are some cases where auto-typing is not a good thing. For example, a value like `012345` would be automatically converted to the number `12345`.

• To avoid autotyping to numerical, use the `strval("mycolumn")` function (see above)
• To manually obtain an array or object, use the `parseJson` function
• To manually obtain a date, use the `asDate` function

Boolean values¶

The formula language uses “True” and “False”, with quotes, as boolean values for true and false.

Operators¶

The formula language supports the classical arithmetic operators:

• Regular math operators: `+`, `-`, `*`, `/`
• Comparison operators (evaluate to booleans): `>`, `>=`, `<`, `<=`, `==`, `!=`
• Arithmetic operators: `//` (integer division) and `%` (modulo)
• Boolean operators: `&&`, `||`
• The `+` operator also performs string concatenation.
• The comparison operators (and only them) can operate on dates. For dates arithmetic, see the `diff` and `inc` functions.

Array and object operations¶

Formula support accessing array elements and object keys using the traditional Python/Javascript syntax:

• `array`
• `object["key"]`
• `object.key` (only valid if ‘key’ is a valid identifier, i.e. matches `[A-Za-z0-9_]*`)

Note: this requires that you actually have an array or object. You might need to use the `parseJson` function (see above paragraph about typing)

Object notations¶

For all functions, you can use them the “regular” way: `replace(str, 'a', 'b')` or in the “object” way: `str.replace('a', 'b')`

In object notation, the first argument to the function is replaced by its ‘context’.

For example, the two syntaxes are equivalent:

```length(trim(replace(foo, 'a', 'b')))

foo.replace('a', 'b').trim().length()
```

Array functions¶

arrayContains(array a, item) boolean¶

Returns whether the array `a` contains `item`

arrayDedup(array a) array¶

Returns array a with duplicates removed

arrayIndexOf(array a, item) int¶

the index (starting from 0) of `item` in the array `a`. Return -1 if `item` is not in `a`

arrayLen(array a) int¶

Returns the length of the array `a`

Reverses array a

Sorts array a

get(o, number or string from, optional number to) Depends on actual arguments¶

If o has fields, returns the field named ‘from’ of o. If o is an array, returns o[from, to]. if o is a string, returns o.substring(from, to)

join(array a, string sep) string¶

Returns the string obtained by joining the elements of array `a` with the separator `sep`

For example, the expression `join(date_elements, '-')` on a column `date_elements` produces:

date_elements output
[2007, 7, 15] 2007-7-15
[2016, 1, 8] 2016-1-8

objectKeys(object o) array¶

Returns the keys of an object as an array

objectValues(object o) array¶

Returns the keys of an object as an array

slice(o, number from, optional number to) Depends on actual arguments¶

If `o` is an array, returns `o[from, to]`. If `o` is a string, returns `o.substring(from, to)`

substring(o, number from, optional number to) Depends on actual arguments¶

If `o` is an array, returns `o[from, to]`. If `o` is a string, returns `o.substring(from, to)`

Boolean functions¶

and(boolean a, boolean b) boolean¶

Returns `a AND b`

not(boolean b) boolean¶

Returns the opposite of `b`

or(boolean a, boolean b) boolean¶

Returns `a OR b`

Date functions¶

asDate(o, format1, format2, … (all optional)) date¶

Returns o converted to a date. If you don’t give a format, ISO-8601 is used. You can give an ordered list of possible formats using this syntax: https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html

datePart(date d, string part) number¶

Returns part of a date. The available parts are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’), ‘weekday’ (the name of the day of the week, capitalized), ‘time’ (number of milliseconds since epoch)

Returned parts are always in local timezone and english

diff(o1, o2, optional string unit) number¶

Returns the difference between two dates in given time units. The available units are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’)

inc(date d, number value, string unit) date¶

Returns a date incremented or decremented by the given amount in the given unit of time. The available units are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’)

now() date¶

Returns the current time

trunc(date d, string unit) date¶

Truncates a date to a given unit. The available units are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’)

Math functions¶

abs(number d) number¶

Returns the absolute value of a number

acos(number d) number¶

Returns the arc cosine of an angle, in the range 0 through PI

asin(number d) number¶

Returns the arc sine of an angle in the range of -PI/2 through PI/2

atan(number d) number¶

Returns the arc tangent of an angle in the range of -PI/2 through PI/2

atan2(number x, number y) number theta¶

Converts rectangular coordinates (x, y) to polar (r, theta)

ceil(number d) number¶

Returns the ceiling of a number

combin(number n, number k) number¶

Returns the number of combinations for `n` elements as divided into `k`

cos(number d) number¶

Returns the trigonometric cosine of an angle

cosh(number d) number¶

Returns the hyperbolic cosine of a value

dec2hex(long) string¶

Returns an hexadecimal representation of the input number

degrees(number d) number¶

Converts an angle from radians to degrees.

even(number d) number¶

Rounds the number up to the nearest even integer

exp(number n) number¶

Returns the exponential of a number

fact(number i) number¶

Returns the factorial of a number

factn(number i) number¶

Returns the factorial of a number

floor(number d) number¶

Returns the floor of a number

gcd(number d, number e) number¶

Returns the greatest common denominator of the two numbers

hash(string) long¶

Returns a 64 bits numerical hash of the input (not crypto-secure)

hex2dec(string) long¶

Returns a decimal representation of an hexadecimal string

lcm(number d, number e) number¶

Returns the greatest common denominator of the two numbers

ln(number n) number¶

Returns the natural log of a number

log(number n) number¶

Returns the base 10 log of a number

max(number a, number b) number¶

Returns the greater of two numbers

min(number a, number b) number¶

Returns the smaller of two numbers

mod(number a, number b) number¶

Returns a modulus b

multinomial(number d1, number d2 …) number¶

Calculates the multinomial of a series of numbers

odd(number d) number¶

Rounds the number up to the nearest odd integer

pow(number a, number b) number¶

Returns `a` to the power of `b`

quotient(number numerator, number denominator) number¶

Returns the integer portion of a division

Converts an angle in degrees to radians

rand(optional int min, optional int max) double or int¶

Without arguments, returns a random float between 0 and 1. With `min` and `max` arguments, returns a random integer between `min` (inclusive) and `max` (exclusive)

round(number n) number¶

Returns the rounding of number to the nearest integer

sin(number d) number¶

Returns the trigonometric sine of an angle

sinh(number d) number¶

Returns the hyperbolic sine of an angle

sum(array a) number¶

Sums the numbers of an array. Skips non-number elements from the array.

tan(number d) number¶

Returns the trigonometric tangent of an angle

tanh(number d) number¶

Returns the hyperbolic tangent of a value

toNumber(o) number¶

Returns o converted to a number

Object functions¶

hasField(o, string name) boolean¶

Returns whether o has field name

htmlAttr(Element e, String s) string¶

Selects a value from an attribute on an Html Element

htmlText(Element e) string¶

Selects the text from within an element (including all child elements)

innerHtml(Element e) string¶

The innerHtml of an HTML element

jsonize(value) JSON literal value¶

Quotes a value as a JSON literal value

objectDel(key, [key…]) object¶

Removes one or several keys from an object and returns it. The keys must not be null

objectNew(k1, v1, k2, v2, …) object¶

Creates a new object, optionally pre-filled with key/values. Must get an even number of arguments, as successive key-value pairs. Giving 0 arguments is possible and will return an empty object (you can use objectPut to add to it)

objectPut(object o, key, value) object¶

Adds a key/value pair to an object and returns it. key must not be null

ownText(Element e) string¶

Gets the text owned by this HTML element only; does not get the combined text of all children.

parseHtml(string s) HTML object¶

Parses a string as HTML

parseJson(string s) object or array¶

Parses a JSON string as an object or array

select(Element e, String s) HTML Elements¶

Selects an element from an HTML elementn using selector syntax

type(object o) string¶

Returns the type of o

String functions¶

coalesce(value1, value2, …) anything¶

Returns the first non-empty value.

chomp(string str, string tail) string¶

Removes `tail` from the end of `str` if it’s there, otherwise leave it alone.

contains(string s, string frag) boolean¶

Returns whether `s` contains `frag`

endsWith(string s, string sub) boolean¶

Returns whether `s` ends with `sub`

escape(string s, string mode) string¶

Escapes a string depending on the given escaping mode. Supported modes: ‘html’,’xml’,’csv’,’url’,’javascript’

format(string format, object… args) string¶

Formats a string using printf-like formatting. For example `'%4d-%02d'.format(2004,2)` gives `'2004-02'`.

This uses the Java Formatter syntax

fromBase64(string s, optional string charset) string¶

Returns the string whose Base64 representation is given. By default, the string is read using the UTF-8 charset.

indexOf(string s, string sub) number¶

Returns the index of the first ocurrence of `sub` in `s`. Returns -1 if there is no such occurrence.

lastIndexOf(string s, string sub) number¶

Returns the index of the last ocurrence of `sub` in `s`. Returns -1 if there is no such occurrence.

length(array or string o) number¶

Returns the length of o

match(string, string or regexp) array of strings¶

Returns an array of the groups matching the given regular expression

For example, `match('hello world', 'he(.*).*(rl)d')` returns `['ll', 'rl']`

md5(string s) string¶

Returns the MD5 hash of a string

partition(string s, string or regex frag, optional boolean omitFragment) array¶

Returns an array of strings `[a,frag,b]` where `a` is the part before the first occurrence of `frag` in `s` and `b` is the part after the occurrence. If `omitFragment` is true, `frag` is not returned in the array.

replace(string s, string or regex f, string replacement) string¶

Replaces all occurrences of a substring or regex in a string

replaceChars(string s, string f, string r) string¶

Returns the string obtained by replacing all chars in `f` with the char in `s` at that same position

For example, `replaceChars('abcba', 'bc', 'yz')` returns `ayzya` The function can be used to delete characters, by replacing them by nothing, like in: `replaceChars('abcba', 'bc', 'y')` returns `ayya`

rpartition(string s, string or regex frag, optional boolean omitFragment) array¶

Returns an array of strings `[a,frag,b]` where `a` is the part before the last occurrence of `frag` in `s` and `b` is the part after the occurrence. If `omitFragment` is true, `frag` is not returned in the array.

sha1(string s) string¶

Returns the SHA-1 hash of a string

split(string s, string or regex sep, optional boolean preserveAllTokens) array¶

Returns the array of strings obtained by splitting `s` with separator `sep`. If `preserveAllTokens` is true, then empty segments are preserved.

splitByCharType(string s) array¶

Returns an array of strings obtained by splitting s grouping consecutive chars by their unicode type

splitByLengths(string s, number length1, …) array¶

Returns the array of strings obtained by splitting a string into substrings with the given lengths

For example, `splitByLengths('abcdefgh', 2, 3, 1)` returns `['ab','cde','f']` The function generates one element in the array for each length passed as parameter. If the input string is not long enough, the last elements are empty.

startsWith(string s, string sub) boolean¶

Returns whether `s` starts with `sub`

strip(string s) string¶

Returns copy of the string, with leading and trailing whitespace omitted.

toBase64(string s, optional string charset) string¶

Returns the Base64 representation of a string. By default, the string is written using the UTF-8 charset.

toLowercase(string s) string¶

Converts a string to lowercase

toString(o, string format (optional)) string¶

Returns o converted to a string

toTitlecase(string s) string¶

Converts a string to titlecase

toUppercase(string s) string¶

Converts a string to uppercase

trim(string s) string¶

Returns copy of the string, with leading and trailing whitespace omitted.

unescape(string s, string mode) string¶

Unescapes all escaped parts of the string depending on the given escaping mode. Available modes: ‘html’,’xml’,’csv’,’url’,’javascript’

unicode(string s) string¶

Returns the input string as an array of the unicode codepoints (numbers)

unicodeType(string s) string¶

Returns an array of strings describing each character of the input string in their full unicode notation

Value access functions¶

numval(object o) number¶

Returns the numerical value of a given cell or expression. If the value of the cell or expression is not numerical, an empty value is returned.

For example, with columns `type` (with values “high” and “low”), `cost_high` and `cost_low`, the expression `numval('cost_' + type)` produces:

cost_low cost_high type output
200 500 low 200.0
100 400 high 400.0
xx 400 low

Warning

`mycolumn` evaluates to the value of the column named mycolumn

Thus, `numval(mycolumn)` would not work because it would try to read as a numerical the value of the column whose name is the value of the mycolumn column.

Instead, use `numval("mycolumn")`.

strval(object o, optional string defaultValue) string¶

Returns the numerical value of a given cell or expression, or the default value if the value of the cell or expression is empty.

For example, with columns `type` (with values “high” and “low”), `code_high` and `code_low`, the expression `strval('code_' + type, 'N/A')` produces:

code_low code_high type output
200 500 low 200
100 4x0 high 4x0
400 low N/A

Warning

`mycolumn` evaluates to the value of the column named mycolumn

Thus, `strval(mycolumn)` would not work because it would try to read as a string the value of the column whose name is the value of the mycolumn column.

Instead, use `strval("mycolumn")`.

Control structures¶

Control structures allow you to perform advanced operations.

Beware: control structures cannot use object notation !

filter(array a, variable v, expression e) (returns: array)¶

Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array if the result is truish.

Note

`filter(myarray, v, v < 2)` is equivalent to this Python syntax: `[v for v in myarray if v < 2]` and returns the array with only the elements below 2

forEeach(array a, variable v, expression e) (returns: array)¶

Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array.

Note

`forEach(myarray, v, v + 2)` is equivalent to this Python syntax: `[v+2 for v in myarray]`

forEachIndex(array a, variable i, variable v, expression e) (returns: array)¶

Evaluates expression a to an array. Then for each array element, binds its value to variable name v and its index to variable name i, evaluates expression e, and pushes the result onto the result array.

Note

`forEachIndex(myarray, i, v, v + i)` is equivalent to this Javascript syntax: `myarray.map(function(v, i) { return v+i ;})`

forRange(from, to, step, variable v, expression e) (returns: array)¶

Iterates, tarting at from, incrementing by step each time while less than to. At each iteration, binds the variable v to the iteration value, evaluates expression e, and pushes the result onto the result array.

Note

`forRange(0, 100, 3, v, v * 2)` is equivalent to this Python syntax: `[v * 2 for v in xrange(0, 100, 3)]` and returns [0, 6, 12, 18 …. 198]

if(boolean, expression_true, expression_false) (returns: anything)¶

Evaluates to expression_true if the condition is true, to expression_false otherwise

objectFilter(expression a, variable k, variable v, expression test) object¶

Evaluates expression a to an object. Then for each element (k, v) of this object, binds its key to variable name k, its value to variable name v, evaluates expression test which should return a boolean. If the boolean is true, pushes (k, v) onto the result object.

with(expression o, variable v, expression e) (returns: any)¶

Evaluates expression o and binds its value to variable name v. Then evaluates expression e and returns that result.

The `with` control allows you to “split” a very big expression into more manageable chunks. It also makes reusing the result of a complex computation easier to read and faster to process.

Tests¶

Beware: tests cannot use object notation

isBlank(expression o) boolean¶

Returns whether o is null or an empty string

isError(expression o) boolean¶

Returns whether o is an error

isNonBlank(expression o) boolean¶

Returns whether o is not null and not an empty string

isNotNull(expression o) boolean¶

Returns whether o is not null

isNull(expression o) boolean¶

Returns whether o is null

isNumeric(expression o) boolean¶

Returns whether o can represent a number