# Formula language¶

DSS includes a language to write formulas, much like a spreadsheet.

Formulas can be used:

• In data preparation, to create new columns, filter rows or flag rows
• More generally, to filter rows in many places of DSS:
• In the Filtering recipe, to filter rows
• In Machine Learning, to define the extracts to use for train and test set
• In the Python and Javascript APIs, to obtain partial extracts from the datasets
• In the Public API, to obtain partial extracts from the datasets
• In the grouping, window, join and stack recipes, to perform pre and post filtering

## Basic usage¶

Note

We also have a Howto <https://www.dataiku.com/learn/guide/howto/> that lists common use cases with examples

Formulas define an expression, that applies row per row.

Assuming that you have a dataset with columns N1 (numeric), N2 (numeric) and S (string) , here are a few example formulas:

• `2 + 2`
• `N1 + N2`
• `min(N1, N2)` # Returns the smallest of N1 and N2
• `replace(S, 'old', 'new')` # Returns the value of S with ‘old’ replaced by ‘new’
• `if (N1 > N2, 'big', 'small')` # Returns big is N1 > N2, small otherwise

In almost all formulas, you’ll need to read the values of the columns for the current row.

When the column has a “simple” name (i.e: starting by a letter, contains only letters, numbers and underscores), you just need to use the name of the column in the fomula: `N1 + 4`

For other cases, you can use:

• `strval("column with spaces")` : returns the value of “column with spaces” as a string
• `num("column with spaces")` : returns the value of “column with spaces” as a number

## Variables typing¶

Variables in the formula language can have one of the following types: string, integer, decimal, array, object, boolean, date

Regardles of the schema of the dataset, or the meanings of the columns (in the case of working on dynamically-generated columns), the following rules are applied:

• If column values are “standard” decimal, they are automatically parsed to decimal
• If column values are “standard” integer, they are automatically parsed to integer
• Else, they are kept as string

Columns containing dates, arrays, objects or booleans are not automatically converted to the matching Formula type. However:

• all functions that require an array will automatically attempt to convert a string input to array (using the regular DSS JSON syntax)
• all functions that require a date will automatically attempt to convert a string input to date (using the ISO-8601 format)

In other words, if the column “begin_date” is a date (and thus contains properly-formatted ISO-8601):

• `type(begin_date)` returns “string”
• `inc(begin_date, 2, "days")` works as expected because the inc function performed automatic conversion to date
• To manually obtain an array or object, use the `parseJson` function
• To manually obtain a date, use the `asDate` function

## Operators¶

The formula language supports the classical arithmetic operators:

• Regular math operators: `+, -, *, /`
• Comparison operators (evaluate to booleans): `>, >=, <, <=, ==, !=`
• Arithmetic operators: // (integer division) and ‘%’ (modulo)
• The ‘+’ operator also performs string concatenation.
• The comparison operators (and only them) can operate on dates - For date operations, see the `diff` and `inc` functions.

## Array and object operations¶

Formula support accessing array elements and object keys using the traditional Python/Javascript syntax:

• `array[0]`
• `object["key"]`
• `object.key` (only valid if ‘key’ is a valid identifier, i.e. matches `[A-Za-z0-9_]*`)

Note: this requires that you actually have an array or object. You might need to use the `parseJson` function (see above paragraph about typing)

## Object notations¶

For all functions, you can use them the “regular” way: `replace(str, 'a', 'b')` or in the “object” way: `str.replace('a', 'b')`

In object notation, the first argument to the function is replaced by its ‘context’.

For example, the two syntaxes are equivalent:

```length(trim(replace(foo, 'a', 'b')))

foo.replace('a', 'b').trim().length()
```

## Array functions¶

### arrayContains(array a, item) boolean¶

Returns whether the array `a` contains `item`

### arrayDedup(array a) array¶

Returns array a with duplicates removed

### arrayIndexOf(array a, item) int¶

the index (starting from 0) of `item` in the array `a`. Return -1 if `item` is not in `a`

### arrayLen(array a) int¶

Returns the length of the array `a`

Reverses array a

Sorts array a

### get(o, number or string from, optional number to) Depends on actual arguments¶

If o has fields, returns the field named ‘from’ of o. If o is an array, returns o[from, to]. if o is a string, returns o.substring(from, to)

### join(array a, string sep) string¶

Returns the string obtained by joining the elements of array `a` with the separator `sep`

For example, the expression `join(date_elements, '-')` on a column `date_elements` produces:

date_elements output
[2007, 7, 15] 2007-7-15
[2016, 1, 8] 2016-1-8

### objectKeys(object o) array¶

Returns the keys of an object as an array

### objectValues(object o) array¶

Returns the keys of an object as an array

### slice(o, number from, optional number to) Depends on actual arguments¶

If `o` is an array, returns `o[from, to]`. If `o` is a string, returns `o.substring(from, to)`

### substring(o, number from, optional number to) Depends on actual arguments¶

If `o` is an array, returns `o[from, to]`. If `o` is a string, returns `o.substring(from, to)`

## Boolean functions¶

### and(boolean a, boolean b) boolean¶

Returns `a AND b`

### not(boolean b) boolean¶

Returns the opposite of `b`

### or(boolean a, boolean b) boolean¶

Returns `a OR b`

## Date functions¶

### asDate(o, format1, format2, ... (all optional)) date¶

Returns o converted to a date. If you don’t give a format, ISO-8601 is used. You can give an ordered list of possible formats using this syntax: https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html

### datePart(date d, string part) number¶

Returns part of a date. The available parts are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’), ‘weekday’ (the name of the day of the week, capitalized), ‘time’ (number of milliseconds since epoch)

Returned parts are always in local timezone and english

### diff(o1, o2, optional string unit) number¶

Returns the difference between two dates in given time units. The available units are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’)

### inc(date d, number value, string unit) date¶

Returns a date incremented or decremented by the given amount in the given unit of time. The available units are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’)

### now() date¶

Returns the current time

### trunc(date d, string unit) date¶

Truncates a date to a given unit. The available units are: ‘hours’ (or ‘hour’ or ‘h’), ‘minutes’ (or ‘minute’ or ‘min’), ‘seconds’ (or ‘second’ or ‘s’), ‘years’ (or ‘year’), ‘months’ (or ‘month’), ‘days’ (or ‘day’ or ‘d’), ‘weeks’ (or ‘week’ or ‘w’)

## Math functions¶

### abs(number d) number¶

Returns the absolute value of a number

### acos(number d) number¶

Returns the arc cosine of an angle, in the range 0 through PI

### asin(number d) number¶

Returns the arc sine of an angle in the range of -PI/2 through PI/2

### atan(number d) number¶

Returns the arc tangent of an angle in the range of -PI/2 through PI/2

### atan2(number x, number y) number theta¶

Converts rectangular coordinates (x, y) to polar (r, theta)

### ceil(number d) number¶

Returns the ceiling of a number

### combin(number n, number k) number¶

Returns the number of combinations for `n` elements as divided into `k`

### cos(number d) number¶

Returns the trigonometric cosine of an angle

### cosh(number d) number¶

Returns the hyperbolic cosine of a value

### dec2hex(long) string¶

Returns an hexadecimal representation of the input number

### degrees(number d) number¶

Converts an angle from radians to degrees.

### even(number d) number¶

Rounds the number up to the nearest even integer

### exp(number n) number¶

Returns the exponential of a number

### fact(number i) number¶

Returns the factorial of a number

### factn(number i) number¶

Returns the factorial of a number

### floor(number d) number¶

Returns the floor of a number

### gcd(number d, number e) number¶

Returns the greatest common denominator of the two numbers

### hash(string) long¶

Returns a 64 bits numerical hash of the input (not crypto-secure)

### hex2dec(string) long¶

Returns a decimal representation of an hexadecimal string

### lcm(number d, number e) number¶

Returns the greatest common denominator of the two numbers

### ln(number n) number¶

Returns the natural log of a number

### log(number n) number¶

Returns the base 10 log of a number

### max(number a, number b) number¶

Returns the greater of two numbers

### min(number a, number b) number¶

Returns the smaller of two numbers

### mod(number a, number b) number¶

Returns a modulus b

### multinomial(number d1, number d2 ...) number¶

Calculates the multinomial of a series of numbers

### odd(number d) number¶

Rounds the number up to the nearest odd integer

### pow(number a, number b) number¶

Returns `a` to the power of `b`

### quotient(number numerator, number denominator) number¶

Returns the integer portion of a division

Converts an angle in degrees to radians

### rand(optional int min, optional int max) double or int¶

Without arguments, returns a random float between 0 and 1. With `min` and `max` arguments, returns a random integer between `min` (inclusive) and `max` (exclusive)

### round(number n) number¶

Returns the rounding of number to the nearest integer

### sin(number d) number¶

Returns the trigonometric sine of an angle

### sinh(number d) number¶

Returns the hyperbolic sine of an angle

### sum(array a) number¶

Sums the numbers of an array. Skips non-number elements from the array.

### tan(number d) number¶

Returns the trigonometric tangent of an angle

### tanh(number d) number¶

Returns the hyperbolic tangent of a value

### toNumber(o) number¶

Returns o converted to a number

## Object functions¶

### hasField(o, string name) boolean¶

Returns whether o has field name

### htmlAttr(Element e, String s) string¶

Selects a value from an attribute on an Html Element

### htmlText(Element e) string¶

Selects the text from within an element (including all child elements)

### innerHtml(Element e) string¶

The innerHtml of an HTML element

### jsonize(value) JSON literal value¶

Quotes a value as a JSON literal value

### objectDel(key, [key...]) object¶

Removes one or several keys from an object and returns it. The keys must not be null

### objectNew(k1, v1, k2, v2, ...) object¶

Creates a new object, optionally pre-filled with key/values. Must get an even number of arguments, as successive key-value pairs. Giving 0 arguments is possible and will return an empty object (you can use objectPut to add to it)

### objectPut(object o, key, value) object¶

Adds a key/value pair to an object and returns it. key must not be null

### ownText(Element e) string¶

Gets the text owned by this HTML element only; does not get the combined text of all children.

### parseHtml(string s) HTML object¶

Parses a string as HTML

### parseJson(string s) object or array¶

Parses a JSON string as an object or array

### select(Element e, String s) HTML Elements¶

Selects an element from an HTML elementn using selector syntax

### type(object o) string¶

Returns the type of o

## String functions¶

### chomp(string str, string tail) string¶

Removes `tail` from the end of `str` if it’s there, otherwise leave it alone.

### contains(string s, string frag) boolean¶

Returns whether `s` contains `frag`

### endsWith(string s, string sub) boolean¶

Returns whether `s` ends with `sub`

### escape(string s, string mode) string¶

Escapes a string depending on the given escaping mode. Supported modes: ‘html’,’xml’,’csv’,’url’,’javascript’

### fromBase64(string s, optional string charset) string¶

Returns the string whose Base64 representation is given. By default, the string is read using the UTF-8 charset.

### indexOf(string s, string sub) number¶

Returns the index of the first ocurrence of `sub` in `s`. Returns -1 if there is no such occurrence.

### lastIndexOf(string s, string sub) number¶

Returns the index of the last ocurrence of `sub` in `s`. Returns -1 if there is no such occurrence.

### length(array or string o) number¶

Returns the length of o

### match(string, string or regexp) array of strings¶

Returns an array of the groups matching the given regular expression

For example, `match('hello world', 'he(.*).*(rl)d')` returns `['ll', 'rl']`

### md5(string s) string¶

Returns the MD5 hash of a string

### partition(string s, string or regex frag, optional boolean omitFragment) array¶

Returns an array of strings `[a,frag,b]` where `a` is the part before the first occurrence of `frag` in `s` and `b` is the part after the occurrence. If `omitFragment` is true, `frag` is not returned in the array.

### replace(string s, string or regex f, string replacement) string¶

Replaces all occurrences of a substring or regex in a string

### replaceChars(string s, string f, string r) string¶

Returns the string obtained by replacing all chars in `f` with the char in `s` at that same position

For example, `replaceChars('abcba', 'bc', 'yz')` returns `ayzya` The function can be used to delete characters, by replacing them by nothing, like in: `replaceChars('abcba', 'bc', 'y')` returns `ayya`

### rpartition(string s, string or regex frag, optional boolean omitFragment) array¶

Returns an array of strings `[a,frag,b]` where `a` is the part before the last occurrence of `frag` in `s` and `b` is the part after the occurrence. If `omitFragment` is true, `frag` is not returned in the array.

### sha1(string s) string¶

Returns the SHA-1 hash of a string

### split(string s, string or regex sep, optional boolean preserveAllTokens) array¶

Returns the array of strings obtained by splitting `s` with separator `sep`. If `preserveAllTokens` is true, then empty segments are preserved.

### splitByCharType(string s) array¶

Returns an array of strings obtained by splitting s grouping consecutive chars by their unicode type

### splitByLengths(string s, number length1, ...) array¶

Returns the array of strings obtained by splitting a string into substrings with the given lengths

For example, `splitByLengths('abcdefgh', 2, 3, 1)` returns `['ab','cde','f']` The function generates one element in the array for each length passed as parameter. If the input string is not long enough, the last elements are empty.

### startsWith(string s, string sub) boolean¶

Returns whether `s` starts with `sub`

### strip(string s) string¶

Returns copy of the string, with leading and trailing whitespace omitted.

### toBase64(string s, optional string charset) string¶

Returns the Base64 representation of a string. By default, the string is written using the UTF-8 charset.

### toLowercase(string s) string¶

Converts a string to lowercase

### toString(o, string format (optional)) string¶

Returns o converted to a string

### toTitlecase(string s) string¶

Converts a string to titlecase

### toUppercase(string s) string¶

Converts a string to uppercase

### trim(string s) string¶

Returns copy of the string, with leading and trailing whitespace omitted.

### unescape(string s, string mode) string¶

Unescapes all escaped parts of the string depending on the given escaping mode. Available modes: ‘html’,’xml’,’csv’,’url’,’javascript’

### unicode(string s) string¶

Returns the input string as an array of the unicode codepoints (numbers)

### unicodeType(string s) string¶

Returns an array of strings describing each character of the input string in their full unicode notation

## Value access functions¶

### numval(object o) number¶

Returns the numerical value of a given cell or expression. If the value of the cell or expression is not numerical, an empty value is returned.

For example, with columns `type` (with values “high” and “low”), `cost_high` and `cost_low`, the expression `numval('cost_' + type)` produces:

cost_low cost_high type output
200 500 low 200.0
100 400 high 400.0
xx 400 low

### strval(object o, optional string defaultValue) string¶

Returns the numerical value of a given cell or expression, or the default value if the value of the cell or expression is empty.

For example, with columns `type` (with values “high” and “low”), `code_high` and `code_low`, the expression `strval('code_' + type, 'N/A')` produces:

code_low code_high type output
200 500 low 200
100 4x0 high 4x0
400 low N/A

## Control structures¶

Control structures allow you to perform advanced operations.

Beware: control structures cannot use object notation !

### filter(array a, variable v, expression e) (returns: array)¶

Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array if the result is truish.

Note

`filter(myarray, v, v < 2)` is equivalent to this Python syntax: `[v for v in myarray if v < 2]` and returns the array with only the elements below 2

### forEeach(array a, variable v, expression e) (returns: array)¶

Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array.

Note

`forEach(myarray, v, v + 2)` is equivalent to this Python syntax: `[v+2 for v in myarray]`

### forEachIndex(array a, variable i, variable v, expression e) (returns: array)¶

Evaluates expression a to an array. Then for each array element, binds its value to variable name v and its index to variable name i, evaluates expression e, and pushes the result onto the result array.

Note

`forEachIndex(myarray, i, v, v + i)` is equivalent to this Javascript syntax: `myarray.map(function(v, i) { return v+i ;})`

### forRange(from, to, step, variable v, expression e) (returns: array)¶

Iterates, tarting at from, incrementing by step each time while less than to. At each iteration, binds the variable v to the iteration value, evaluates expression e, and pushes the result onto the result array.

Note

`forRange(0, 100, 3, v, v * 2)` is equivalent to this Python syntax: `[v * 2 for v in xrange(0, 100, 3)]` and returns [0, 6, 12, 18 .... 198]

### if(boolean, expression_true, expression_false) (returns: anything)¶

Evaluates to expression_true if the condition is true, to expression_false otherwise

### objectFilter(expression a, variable k, variable v, expression test) object¶

Evaluates expression k to an object. Then for each object element (k, v), binds its key to variable name k, its value to variable name v, evaluates expression test which should return a boolean. If the boolean is true, pushes (k, v) onto the result object.

### with(expression o, variable v, expression e) (returns: any)¶

Evaluates expression o and binds its value to variable name v. Then evaluates expression e and returns that result.

The `with` control allows you to “split” a very big expression into more manageable chunks. It also makes reusing the result of a complex computation easier to read and faster to process.

## Tests¶

Beware: tests cannot use object notation

### isBlank(expression o) boolean¶

Returns whether o is null or an empty string

### isError(expression o) boolean¶

Returns whether o is an error

### isNonBlank(expression o) boolean¶

Returns whether o is not null and not an empty string

### isNotNull(expression o) boolean¶

Returns whether o is not null

### isNull(expression o) boolean¶

Returns whether o is null

### isNumeric(expression o) boolean¶

Returns whether o can represent a number