Writing scalar functions in DuckDB extensions
The analytical database system DuckDB offers an extension mechanism allowing you to add your own functionality to the system. The C++ extension template provides a starting point for extensions and contains two example scalar functions. This blog post goes into more detail on how to write such functions.
What I write in this blog post is what I’ve learned while experimenting with DuckDB extensions (I am not affiliated with DuckDB), so mistakes are possible :) This post is also just an introduction: there is a lot more that won’t be discussed here!
The examples in this post are bundled in a GitHub repository that is based on the extension template, so you can run it and try all functions out yourself. The scope is limited to scalar functions, i.e. functions that take one or more values as input and return another value (such as replace
, exp
, and many more). Out of scope topics include aggregate functions and table functions.
The basics
We’ll start with the example that’s provided in the DuckDB extension template, the quack
scalar function:
inline void QuackScalarFun(DataChunk &args, ExpressionState &state,
Vector &result) {
auto &name_vector = args.data[0];
UnaryExecutor::Execute<string_t, string_t>(
name_vector, result, args.size(), [&](string_t name) {
return StringVector::AddString(result,
"Quack " + name.GetString() + " 🐥");
});
}
// ...
static void LoadInternal(DatabaseInstance &instance) {
auto quack_scalar_function = ScalarFunction(
"quack", {LogicalType::VARCHAR}, LogicalType::VARCHAR, QuackScalarFun);
ExtensionUtil::RegisterFunction(instance, quack_scalar_function);
// ...
}
The quack(VARCHAR) -> VARCHAR
function is implemented by QuackScalarFun
. This function takes three arguments:
- A
DataChunk
, representing a set of inputVector
s, one for each argument of the scalar function. - An
ExpressionState
, containing information about the query’s expression state. (This argument is beyond the scope of this post.) - A
Vector
to store the result values. This is a logical representation of an array with data of a single type. There are different formats of vectors.
UnaryExecutor::Execute<TA, TR>
enables an efficient evaluation of a function over the contents of a vector, regardless of its format. The passed lambda expression takes an argument of type TA
and returns a value of type TR
. Both are duckdb::string_t
in this situation because quack
takes and returns a VARCHAR
. The string_t
type has a GetString()
method to obtain an std::string
. (string_t
is also used for BLOB
s, see later this post.)
The call to StringVector::AddString
is specific to functions returning a string. If you return a primitive type, such as double
, just return the value directly.
In LoadInternal
, we register our quack
function. We create an instance of ScalarFunction
and pass the necessary details: function name, types of input arguments, result type, and implementation. Then we do the actual registration with ExtensionUtil::RegisterFunction
.
There are other executors similar to UnaryExecutor
for functions with different arities. Here’s an example of a ternary function, discriminant(a, b, c)
, using the TernaryExecutor
to return the discriminant of the quadratic equation ax^2 + bx + c = 0
:
inline void DiscriminantScalarFun(DataChunk &args, ExpressionState &state,
Vector &result) {
auto &a_vector = args.data[0];
auto &b_vector = args.data[1];
auto &c_vector = args.data[2];
TernaryExecutor::Execute<double, double, double, double>(
a_vector, b_vector, c_vector, result, args.size(),
[&](double a, double b, double c) {
auto discriminant = b * b - 4 * a * c;
return discriminant;
});
}
// ...
static void LoadInternal(DatabaseInstance &instance) {
// ...
auto discriminant_scalar_function = ScalarFunction(
"discriminant",
{LogicalType::DOUBLE, LogicalType::DOUBLE, LogicalType::DOUBLE},
LogicalType::DOUBLE, DiscriminantScalarFun);
ExtensionUtil::RegisterFunction(instance, discriminant_scalar_function);
// ...
}
Build the extension and give it a try:
D select discriminant(2, -15, 8);
┌─────────────────────────┐
│ discriminant(2, -15, 8) │
│ double │
├─────────────────────────┤
│ 161.0 │
└─────────────────────────┘
Types
Some of the SQL types have a straightforward equivalent in C++ (e.g., DOUBLE
-> double
as used in the previous example, INTEGER
-> int32_t
). Others may need some extra explanation and are discussed here.
BLOB
Just like VARCHAR
s, BLOB
s are represented with duckdb::string_t
. As we have seen in the quack
example, this type provides a GetString
method, but when dealing with BLOB
s, we likely want to work with the raw bytes instead. string_t
also provides the const char* GetData()
and idx_t GetSize()
methods. GetData
returns a pointer to a singed char; to obtain a byte pointer (const uint8_t *
) instead, use const_data_ptr_cast(yourBlob.GetData())
.
BITSTRING
A BITSTRING
is a variable-length string of 1s and 0s. This type is also represented with string_t
, encoded in the manner described in src/common/types/bit.cpp:
First byte in bitstring contains amount of padded bits,
second byte in bitstring is the padded byte,
therefore the rest of the data starts at data + 2 (third byte)
Date types
A DATE
is represented using the date_t
struct, of which the days
field contains the number of days since 1970-01-01.
An INTERVAL
is represented using the interval_t
struct, which has three fields you can make use of: months
, days
, and micros
.
A TIMESTAMP
is represented using the timestamp_t
structs, of which the value
field contains the number of microseconds since 1970-01-01. The same file contains structs for the other timestamp types.
All of these structs can be used as input or result type with Execute
.
Returning NULL
The Execute
method from the previous examples will automatically set the result to NULL if any of the input arguments is NULL. However, if we want to choose to return NULL ourselves, we need to use the ExecuteWithNulls
method instead. Its usage is similar to Execute
, but the lambda takes two extra arguments: a ValidityMask
and an index, that can be used to mark a vector element as “invalid” (NULL).
Let’s demonstrate this by adding a function fibonacci(INTEGER) -> BIGINT
that returns the n’th Fibonacci number, except if n < 0
(undefined) or n > 92
(result does not fit in a 64-bit integer):
inline void FibonacciScalarFun(DataChunk &args, ExpressionState &state,
Vector &result) {
auto &input_vector = args.data[0];
auto Phi = (1 + sqrt(5)) / 2;
auto phi = Phi - 1;
UnaryExecutor::ExecuteWithNulls<int32_t, int64_t>(
input_vector, result, args.size(),
[&](int32_t input, ValidityMask &mask, idx_t idx) {
if (input >= 0 && input < 93) {
return lround((pow(Phi, input) - pow(-phi, input)) / sqrt(5));
} else {
mask.SetInvalid(idx);
return 0l;
}
});
}
// ...
static void LoadInternal(DatabaseInstance &instance) {
// ...
auto fibonacci_scalar_function =
ScalarFunction("fibonacci", {LogicalType::INTEGER}, LogicalType::BIGINT,
FibonacciScalarFun);
ExtensionUtil::RegisterFunction(instance, fibonacci_scalar_function);
// ...
}
D select fibonacci(5), fibonacci(-3), fibonacci(100);
┌──────────────┬───────────────┬────────────────┐
│ fibonacci(5) │ fibonacci(-3) │ fibonacci(100) │
│ int64 │ int64 │ int64 │
├──────────────┼───────────────┼────────────────┤
│ 5 │ NULL │ NULL │
└──────────────┴───────────────┴────────────────┘
Setting the result to NULL only takes a call of mask.SetInvalid(idx)
.
(Note that if you want to handle NULL as input differently than “NULL in, NULL out”, neither Execute
nor ExecuteWithNulls
can help and you’ll need to operate on the vectors directly.)
Nested types and GenericExecutor
The discussed ...Executor
types so far do not provide a means to return nested types, such as STRUCT
or LIST
. That is where the GenericExecutor
comes into play.
Returning a STRUCT or LIST
We’re going to write a function solve_quadratic_equation(DOUBLE, DOUBLE, DOUBLE) -> STRUCT(x1 DOUBLE, x2 DOUBLE)
that returns the solutions to the quadratic equation ax^2 + bx + c = 0
. The structure looks similar to what we’ve seen before, but pay attention to the template parameters specifying the input and result types:
#include "duckdb/common/vector_operations/generic_executor.hpp"
// ...
inline void SolveQuadraticEquationScalarFunc(DataChunk &args,
ExpressionState &state,
Vector &result) {
auto &a_vector = args.data[0];
auto &b_vector = args.data[1];
auto &c_vector = args.data[2];
GenericExecutor::ExecuteTernary<
PrimitiveType<double>, PrimitiveType<double>,
PrimitiveType<double>, StructTypeBinary<double, double>>(
a_vector, b_vector, c_vector, result, args.size(),
[&](PrimitiveType<double> a, PrimitiveType<double> b,
PrimitiveType<double> c) {
auto discriminant = b.val * b.val - 4 * a.val * c.val;
StructTypeBinary<double, double> solution;
solution.a_val = (-b.val + sqrt(discriminant)) / (2 * a.val);
solution.b_val = (-b.val - sqrt(discriminant)) / (2 * a.val);
return solution;
});
}
static void LoadInternal(DatabaseInstance &instance) {
// ...
child_list_t<LogicalType> quadratic_equation_solution_child_types;
quadratic_equation_solution_child_types.push_back(
std::make_pair("x1", LogicalType::DOUBLE));
quadratic_equation_solution_child_types.push_back(
std::make_pair("x2", LogicalType::DOUBLE));
auto solve_quadratic_equation_scalar_function = ScalarFunction(
"solve_quadratic_equation",
{LogicalType::DOUBLE, LogicalType::DOUBLE, LogicalType::DOUBLE},
LogicalType::STRUCT(quadratic_equation_solution_child_types),
SolveQuadraticEquationScalarFunc);
ExtensionUtil::RegisterFunction(instance,
solve_quadratic_equation_scalar_function);
// ...
}
Trying it out:
D select solve_quadratic_equation(1, -7, 12);
┌─────────────────────────────────────┐
│ solve_quadratic_equation(1, -7, 12) │
│ struct(x1 double, x2 double) │
├─────────────────────────────────────┤
│ {'x1': 4.0, 'x2': 3.0} │
└─────────────────────────────────────┘
Similar to TernaryExecutor::Execute
, GenericExecutor::ExecuteTernary
takes four template parameters, three for the input types and one for the output type. They look a bit different though: PrimitiveType<INPUT_TYPE>
represents a primitive type, and StructTypeBinary<A_TYPE, B_TYPE>
represents a STRUCT
type with two fields. PrimitiveType
has a val
field containing the underlying value, and StructTypeBinary
has an a_val
and b_val
field for the two values in the struct (which, to be clear, have nothing to do with our a
and b
input values). Similar types also exist for STRUCT
s with different arities.
The registration of the function in LoadInternal
is mostly the same as for the other functions, only the description of the result type is more extensive. The names and types of the STRUCT
’s fields have to be constructed as a child_list_t<LogicalType>
and passed to LogicalType::STRUCT
.
Returning a LIST
is very similar: instead of StructTypeBinary
, use GenericListType<CHILD_TYPE>
, which has a values
field that is an std::vector
of instances of the child type. As LogicalType
in the ScalarFunction
constructor, use LogicalType::LIST(<child LogicalType>)
.
Custom result type with AssignResult
THe implementation of solve_quadratic_equation
is definitely not perfect. If we apply it on an equation with no solutions, e.g. solve_quadratic_equation(1, 2, 3)
, the result is {'x1': -nan, 'x2': -nan}
. And if we choose 0
for the first argument, the result is {'x1': -nan, 'x2': -inf}
, while the equation actually became a linear equation with x = -c / b
as solution (assuming b
isn’t 0 as well).
Let’s say we want to return NULL in those situations, rather than the struct. GenericExecutor
does not have an equivalent of the ExecuteWithNulls
method, but we can take another approach. Instead of choosing StructTypeBinary
as result type, we can create our own type. This type must have a static method AssignResult
that assigns an instance to the result vector. We can use StructTypeBinary
’s implementation of this method as inspiration for our custom type:
struct QuadraticEquationSolution {
double x1;
double x2;
bool exists;
static void AssignResult(Vector &result, idx_t i,
QuadraticEquationSolution solution) {
auto &entries = StructVector::GetEntries(result);
if (!solution.exists) {
FlatVector::SetNull(result, i, true);
} else {
FlatVector::GetData<double>(*entries[0])[i] = solution.x1;
FlatVector::GetData<double>(*entries[1])[i] = solution.x2;
}
}
};
inline void SolveQuadraticEquation2ScalarFunc(DataChunk &args,
ExpressionState &state,
Vector &result) {
auto &a_vector = args.data[0];
auto &b_vector = args.data[1];
auto &c_vector = args.data[2];
GenericExecutor::ExecuteTernary<
PrimitiveType<double>, PrimitiveType<double>,
PrimitiveType<double>, QuadraticEquationSolution>(
a_vector, b_vector, c_vector, result, args.size(),
[&](PrimitiveType<double> a, PrimitiveType<double> b,
PrimitiveType<double> c) {
auto discriminant = b.val * b.val - 4 * a.val * c.val;
QuadraticEquationSolution solution;
if (discriminant >= 0) {
solution.exists = true;
solution.x1 = (-b.val + sqrt(discriminant)) / (2 * a.val);
solution.x2 = (-b.val - sqrt(discriminant)) / (2 * a.val);
} else {
solution.exists = false;
}
return solution;
});
}
// ...
static void LoadInternal(DatabaseInstance &instance) {
// ...
auto solve_quadratic_equation_scalar_function2 = ScalarFunction(
"solve_quadratic_equation2",
{LogicalType::DOUBLE, LogicalType::DOUBLE, LogicalType::DOUBLE},
LogicalType::STRUCT(quadratic_equation_solution_child_types),
SolveQuadraticEquation2ScalarFunc);
ExtensionUtil::RegisterFunction(instance,
solve_quadratic_equation_scalar_function2);
// ...
}
Taking a STRUCT as input argument
StructTypeBinary
(and the other StructType...
types) cannot only be used as result type, but also as input type. Let’s implement a quadratic_equation_from_solution(STRUCT(x1 DOUBLE, x2 DOUBLE)) -> VARCHAR
function that outputs a quadratic equation for which x1
and x2
are solutions:
inline void QuadraticEquationFromSolutionScalarFunc(DataChunk &args,
ExpressionState &state,
Vector &result) {
auto &solution_vector = args.data[0];
GenericExecutor::ExecuteUnary<StructTypeBinary<double, double>,
PrimitiveType<string_t>>(
solution_vector, result, args.size(),
[&](StructTypeBinary<double, double> solution) {
double x1 = solution.a_val;
double x2 = solution.b_val;
double b = -x1 - x2;
double c = x1 * x2;
return StringVector::AddString(result, "x^2 + " + to_string(b) +
"x + " + to_string(c));
});
}
// ...
static void LoadInternal(DatabaseInstance &instance) {
// ...
auto quadratic_equation_from_solution_scalar_function = ScalarFunction(
"quadratic_equation_from_solution",
{LogicalType::STRUCT(quadratic_equation_solution_child_types)},
LogicalType::VARCHAR, QuadraticEquationFromSolutionScalarFunc);
ExtensionUtil::RegisterFunction(
instance, quadratic_equation_from_solution_scalar_function);
}
D select quadratic_equation_from_solution({ 'x1': 4, 'x2': 3});
┌──────────────────────────────────────────────────────────────────────┐
│ quadratic_equation_from_solution(main.struct_pack(x1 := 4, x2 := 3)) │
│ varchar │
├──────────────────────────────────────────────────────────────────────┤
│ x^2 + -7.000000x + 12.000000 │
└──────────────────────────────────────────────────────────────────────┘
Any type that has a static ConstructType
method (see for example the StructTypeBinary
implementation) can be used to represent an input type (which, as you can see, is more complicated than implementing AssignResult
). Also note that while GenericListType
implements AssignResult
, it currently does not implement ConstructType
.
Where to find more examples
That’s it for now! This post still only scratches the surface of scalar functions in DuckDB – as demonstrated by the extensive constructor of ScalarFunction
. Nevertheless, this should be enough to get you started with writing your own extensions with scalar functions! All examples in this post are bundled in a GitHub repository. You can find more examples in these places:
- The DuckDB source code (specifically extension/core_functions/scalar and src/function/scalar) contains the implementation of DuckDB’s built-in functions.
- The
spatial
extension implements many scalar functions. - A bit beyond the scope of scalar functions, but the
ulid
extension is a great reference extension if you want to see how you can add a custom datatype and associated functions.
If you have any feedback, feel free to open an issue on the GitHub repository or contact me on Bluesky.