Top-Level | Table | Verbs | Op Functions | Expressions | Extensibility |
Most Arquero verbs accept table expressions: functions defined over table column values. For example, the derive
verb creates new columns based on the provided expressions:
table.derive({
raise: d => op.pow(d.col1, d.col2),
'col diff': d => d.col1 - d['base col']
})
In the example above, the two arrow function expressions are table expressions. The input argument d
represents a row of the data table, whose properties are column names. Table expressions can include standard JavaScript expressions and invoke functions defined on the op
object, which, depending on the context, may include standard, aggregate, or window functions.
At first glance table expressions look like normal JavaScript functions… but hold on! Under the hood, Arquero takes a set of function definitions, maps them to strings, then parses, rewrites, and compiles them to efficiently manage data internally. From Arquero’s point of view, the following examples are all equivalent:
function(d) { return op.sqrt(d.value); }
d => op.sqrt(d.value)
({ value }) => op.sqrt(value)
d => sqrt(d.value)
d => aq.op.sqrt(d.value)
"d => op.sqrt(d.value)"
"sqrt(d.value)"
Examples 1 through 5 are function definitions, while examples 6 and 7 are string literals. Let’s walk through each:
op
object to invoke functions, it is not required. For any function invocation, the function name will be looked up on the op
object, even if the function is called directly (as in Example 4) or as the result of a nested property lookup (Example 5). Internally, Arquero’s parser doesn’t care if you call sqrt()
, op.sqrt()
, or aq.op.sqrt()
; any will work. That said, using an explicit op
object avoids errors and allows linting and auto-complete to proceed unimpeded.d
; using an identifier other than d
will fail. In contrast, with an explicit function definition you are free to rename the argument as you see fit.A number of JavaScript features are not allowed in table expressions, including internal function definitions, variable updates, and for
loops. The only function calls allowed are those provided by the op
object. (Why? Read below for more…) Most notably, parsed table expressions do not support closures. As a result, table expressions can not access variables defined in the enclosing scope.
To include external variables in a table expression, use the params()
method method to bind a parameter value to a table context. Parameters can then be accessed by including a second argument to a table expression; all bound parameters are available as properties of that argument (default name $
):
table
.params({ threshold: 5 })
.filter((d, $) => d.value < $.threshold)
To pass in a standard JavaScript function that will be called directly (rather than parsed and rewritten), use the escape()
expression helper. Escaped functions do support closures and so can refer to variables defined in an enclosing scope. However, escaped functions do not support aggregate or window operations; they also sidestep internal optimizations and result in an error when attempting to serialize Arquero queries (for example, to pass transformations to a worker thread).
const threshold = 5;
table.filter(aq.escape(d => d.value < threshold))
Alternatively, for programmatic generation of table expressions one can fallback to a generating a string – rather than a proper function definition – and use that instead:
// note: threshold must safely coerce to a string!
const threshold = 5;
table.filter(`d => d.value < ${threshold}`)
Some verbs – including groupby()
, orderby()
, fold()
, pivot()
, and join()
– accept shorthands such as column name strings. Given a table with columns colA
and colB
(in that order), the following are equivalent:
table.groupby('colA', 'colB')
- Refer to columns by nametable.groupby(['colA', 'colB'])
- Use an explicit array of namestable.groupby(0, 1)
- Refer to columns by indextable.groupby(aq.range(0, 1))
- Use a column range helpertable.groupby({ colA: d => d.colA, colB: d => d.colB })
- Explicit table expressionsUnderneath the hood, all of these variants are grounded down to table expressions.
For aggregate and window functions, use of the op
object outside of a table expression allows the use of shorthand references. The following examples are equivalent:
d => op.mean(d.value)
- Standard table expressionop.mean('value')
- Shorthand table expression generatorThe second example produces an object that, when coerced to a string, generates 'd => op.mean(d["value"])'
as a result.
For join verbs, Arquero also supports two-table table expressions. Two-table expressions have an expanded signature that accepts two rows as input, one from the “left” table and one from the “right” table.
table.join(otherTable, (a, b) => op.equal(a.key, b.key))
The use of aggregate and window functions is not allowed within two-table expressions. Otherwise, two-table expressions have the same capabilities and limitations as normal (single-table) table expressions.
Bound parameters can be accessed by including a third argument:
table
.params({ threshold: 1.5 })
.join(otherTable, (a, b, $) => op.abs(a.value - b.value) < $.threshold)
Rather than writing explicit two-table expressions, join verbs can also accept column shorthands in the form of a two-element array: the first element of the array is either a string or string array with columns in the first (left) table, whereas the second element indicates columns in the second (right) table.
Given two tables – one with columns x
, y
and the other with columns u
, v
– the following examples are equivalent:
table.join(other, ['x', 'u'], [['x', 'y'], 'v'])
table.join(other, [['x'], ['u']], [['x', 'y'], ['v']])
table.join(other, ['x', 'u'], [aq.all(), aq.not('u')])
All of which are in turn equivalent to using the following two-table expressions:
table.join(other, ['x', 'u'], {
x: (a, b) => a.x,
y: (a, b) => a.y,
v: (a, b) => b.v
})
op
functions supported?Any function that is callable within an Arquero table expression must be defined on the op
object, either as a built-in function or added via the extensibility API. Why is this the case?
As described earlier, Arquero table expressions can look like normal JavaScript functions, but are treated specially: their source code is parsed and new custom functions are generated to process data. This process prevents the use of closures, such as referencing functions or values defined externally to the expression.
So why do we do this? Here are a few reasons:
Performance. After parsing an expression, Arquero performs code generation, often creating more performant code in the process. This level of indirection also allows us to generate optimized expressions for certain inputs, such as Apache Arrow data.
Flexibility. Providing our own parsing also allows us to introduce new kinds of backing data without changing the API. For example, we could add support for different underlying data formats and storage layouts. More importantly, it also allows us analyze expressions and incorporate aggregate and window functions in otherwise “normal” JavaScript expressions.
Discoverability. Defining all functions on a single object provides a single catalog of all available operations. In most IDEs, you can simply type op.
(and perhaps hit the tab key) to the see a list of all available functions and benefit from auto-complete!
Of course, one might wish to make different trade-offs. Arquero is designed to support common use cases while also being applicable to more complex production setups. This goal comes with the cost of more rigid management of functions. However, Arquero can be extended with custom variables, functions, and even new table methods or verbs! As starting points, see the params and addFunction methods to introduce external variables or register new op
functions.
All that being said, Arquero provides an escape hatch: use the escape()
expression helper to apply a standard JavaScript function as-is, skipping any internal parsing and code generation. As a result, escaped functions do not support aggregation and window operations, as these depend on Arquero’s internal parsing and code generation.