Writing scalar functions in DuckDB extensions

The analytical database system DuckDB offers an extension mechanism allowing you to add your own functionality to the system. The C++ extension template provides a starting point for extensions and contains two example scalar functions. This blog post goes into more detail on how to write such functions.

What I write in this blog post is what I’ve learned while experimenting with DuckDB extensions (I am not affiliated with DuckDB), so mistakes are possible :) This post is also just an introduction: there is a lot more that won’t be discussed here!

The examples in this post are bundled in a GitHub repository that is based on the extension template, so you can run it and try all functions out yourself. The scope is limited to scalar functions, i.e. functions that take one or more values as input and return another value (such as replace, exp, and many more). Out of scope topics include aggregate functions and table functions.

The basics
Types
Returning NULL
Nested types and GenericExecutor
Where to find more examples

The basics

We’ll start with the example that’s provided in the DuckDB extension template, the quack scalar function:

inline void QuackScalarFun(DataChunk &args, ExpressionState &state,
                           Vector &result) {
  auto &name_vector = args.data[0];
  UnaryExecutor::Execute<string_t, string_t>(
      name_vector, result, args.size(), [&](string_t name) {
        return StringVector::AddString(result,
                                       "Quack " + name.GetString() + " 🐥");
      });
}

// ...

static void LoadInternal(DatabaseInstance &instance) {
  auto quack_scalar_function = ScalarFunction(
      "quack", {LogicalType::VARCHAR}, LogicalType::VARCHAR, QuackScalarFun);
  ExtensionUtil::RegisterFunction(instance, quack_scalar_function);

  // ...
}

The quack(VARCHAR) -> VARCHAR function is implemented by QuackScalarFun. This function takes three arguments:

A DataChunk, representing a set of input Vectors, one for each argument of the scalar function.
An ExpressionState, containing information about the query’s expression state. (This argument is beyond the scope of this post.)
A Vector to store the result values. This is a logical representation of an array with data of a single type. There are different formats of vectors.

UnaryExecutor::Execute<TA, TR> enables an efficient evaluation of a function over the contents of a vector, regardless of its format. The passed lambda expression takes an argument of type TA and returns a value of type TR. Both are duckdb::string_t in this situation because quack takes and returns a VARCHAR. The string_t type has a GetString() method to obtain an std::string. (string_t is also used for BLOBs, see later this post.)

The call to StringVector::AddString is specific to functions returning a string. If you return a primitive type, such as double, just return the value directly.

In LoadInternal, we register our quack function. We create an instance of ScalarFunction and pass the necessary details: function name, types of input arguments, result type, and implementation. Then we do the actual registration with ExtensionUtil::RegisterFunction.

There are other executors similar to UnaryExecutor for functions with different arities. Here’s an example of a ternary function, discriminant(a, b, c), using the TernaryExecutor to return the discriminant of the quadratic equation ax^2 + bx + c = 0:

inline void DiscriminantScalarFun(DataChunk &args, ExpressionState &state,
                                  Vector &result) {
  auto &a_vector = args.data[0];
  auto &b_vector = args.data[1];
  auto &c_vector = args.data[2];

  TernaryExecutor::Execute<double, double, double, double>(
      a_vector, b_vector, c_vector, result, args.size(),
      [&](double a, double b, double c) {
        auto discriminant = b * b - 4 * a * c;
        return discriminant;
      });
}

// ...

static void LoadInternal(DatabaseInstance &instance) {
  // ...

  auto discriminant_scalar_function = ScalarFunction(
      "discriminant",
      {LogicalType::DOUBLE, LogicalType::DOUBLE, LogicalType::DOUBLE},
      LogicalType::DOUBLE, DiscriminantScalarFun);
  ExtensionUtil::RegisterFunction(instance, discriminant_scalar_function);

  // ...
}

Build the extension and give it a try:

D select discriminant(2, -15, 8);
┌─────────────────────────┐
│ discriminant(2, -15, 8) │
│         double          │
├─────────────────────────┤
│          161.0          │
└─────────────────────────┘

Types

Some of the SQL types have a straightforward equivalent in C++ (e.g., DOUBLE -> double as used in the previous example, INTEGER -> int32_t). Others may need some extra explanation and are discussed here.

BLOB

Just like VARCHARs, BLOBs are represented with duckdb::string_t. As we have seen in the quack example, this type provides a GetString method, but when dealing with BLOBs, we likely want to work with the raw bytes instead. string_t also provides the const char* GetData() and idx_t GetSize() methods. GetData returns a pointer to a singed char; to obtain a byte pointer (const uint8_t *) instead, use const_data_ptr_cast(yourBlob.GetData()).

BITSTRING

A BITSTRING is a variable-length string of 1s and 0s. This type is also represented with string_t, encoded in the manner described in src/common/types/bit.cpp:

First byte in bitstring contains amount of padded bits,
second byte in bitstring is the padded byte,
therefore the rest of the data starts at data + 2 (third byte)

Date types

A DATE is represented using the date_t struct, of which the days field contains the number of days since 1970-01-01.

An INTERVAL is represented using the interval_t struct, which has three fields you can make use of: months, days, and micros.

A TIMESTAMP is represented using the timestamp_t structs, of which the value field contains the number of microseconds since 1970-01-01. The same file contains structs for the other timestamp types.

All of these structs can be used as input or result type with Execute.

Returning NULL

The Execute method from the previous examples will automatically set the result to NULL if any of the input arguments is NULL. However, if we want to choose to return NULL ourselves, we need to use the ExecuteWithNulls method instead. Its usage is similar to Execute, but the lambda takes two extra arguments: a ValidityMask and an index, that can be used to mark a vector element as “invalid” (NULL).

Let’s demonstrate this by adding a function fibonacci(INTEGER) -> BIGINT that returns the n’th Fibonacci number, except if n < 0 (undefined) or n > 92 (result does not fit in a 64-bit integer):

inline void FibonacciScalarFun(DataChunk &args, ExpressionState &state,
                               Vector &result) {
  auto &input_vector = args.data[0];

  auto Phi = (1 + sqrt(5)) / 2;
  auto phi = Phi - 1;

  UnaryExecutor::ExecuteWithNulls<int32_t, int64_t>(
      input_vector, result, args.size(),
      [&](int32_t input, ValidityMask &mask, idx_t idx) {
        if (input >= 0 && input < 93) {
          return lround((pow(Phi, input) - pow(-phi, input)) / sqrt(5));
        } else {
          mask.SetInvalid(idx);
          return 0l;
        }
      });
}

// ...

static void LoadInternal(DatabaseInstance &instance) {
  // ...

  auto fibonacci_scalar_function =
      ScalarFunction("fibonacci", {LogicalType::INTEGER}, LogicalType::BIGINT,
                     FibonacciScalarFun);
  ExtensionUtil::RegisterFunction(instance, fibonacci_scalar_function);

  // ...
}

D select fibonacci(5), fibonacci(-3), fibonacci(100);
┌──────────────┬───────────────┬────────────────┐
│ fibonacci(5) │ fibonacci(-3) │ fibonacci(100) │
│    int64     │     int64     │     int64      │
├──────────────┼───────────────┼────────────────┤
│      5       │     NULL      │      NULL      │
└──────────────┴───────────────┴────────────────┘

Setting the result to NULL only takes a call of mask.SetInvalid(idx).

(Note that if you want to handle NULL as input differently than “NULL in, NULL out”, neither Execute nor ExecuteWithNulls can help and you’ll need to operate on the vectors directly.)

Nested types and GenericExecutor

The discussed ...Executor types so far do not provide a means to return nested types, such as STRUCT or LIST. That is where the GenericExecutor comes into play.

Returning a STRUCT or LIST

We’re going to write a function solve_quadratic_equation(DOUBLE, DOUBLE, DOUBLE) -> STRUCT(x1 DOUBLE, x2 DOUBLE) that returns the solutions to the quadratic equation ax^2 + bx + c = 0. The structure looks similar to what we’ve seen before, but pay attention to the template parameters specifying the input and result types:

#include "duckdb/common/vector_operations/generic_executor.hpp"

// ...

inline void SolveQuadraticEquationScalarFunc(DataChunk &args,
                                             ExpressionState &state,
                                             Vector &result) {
  auto &a_vector = args.data[0];
  auto &b_vector = args.data[1];
  auto &c_vector = args.data[2];

  GenericExecutor::ExecuteTernary<
      PrimitiveType<double>, PrimitiveType<double>,
      PrimitiveType<double>, StructTypeBinary<double, double>>(
      a_vector, b_vector, c_vector, result, args.size(),
      [&](PrimitiveType<double> a, PrimitiveType<double> b,
          PrimitiveType<double> c) {
        auto discriminant = b.val * b.val - 4 * a.val * c.val;
        StructTypeBinary<double, double> solution;
        solution.a_val = (-b.val + sqrt(discriminant)) / (2 * a.val);
        solution.b_val = (-b.val - sqrt(discriminant)) / (2 * a.val);
        return solution;
      });
}

static void LoadInternal(DatabaseInstance &instance) {
  // ...

  child_list_t<LogicalType> quadratic_equation_solution_child_types;
  quadratic_equation_solution_child_types.push_back(
      std::make_pair("x1", LogicalType::DOUBLE));
  quadratic_equation_solution_child_types.push_back(
      std::make_pair("x2", LogicalType::DOUBLE));
  auto solve_quadratic_equation_scalar_function = ScalarFunction(
      "solve_quadratic_equation",
      {LogicalType::DOUBLE, LogicalType::DOUBLE, LogicalType::DOUBLE},
      LogicalType::STRUCT(quadratic_equation_solution_child_types),
      SolveQuadraticEquationScalarFunc);
  ExtensionUtil::RegisterFunction(instance,
                                  solve_quadratic_equation_scalar_function);

  // ...
}

Trying it out:

D select solve_quadratic_equation(1, -7, 12);
┌─────────────────────────────────────┐
│ solve_quadratic_equation(1, -7, 12) │
│    struct(x1 double, x2 double)     │
├─────────────────────────────────────┤
│ {'x1': 4.0, 'x2': 3.0}              │
└─────────────────────────────────────┘

Similar to TernaryExecutor::Execute, GenericExecutor::ExecuteTernary takes four template parameters, three for the input types and one for the output type. They look a bit different though: PrimitiveType<INPUT_TYPE> represents a primitive type, and StructTypeBinary<A_TYPE, B_TYPE> represents a STRUCT type with two fields. PrimitiveType has a val field containing the underlying value, and StructTypeBinary has an a_val and b_val field for the two values in the struct (which, to be clear, have nothing to do with our a and b input values). Similar types also exist for STRUCTs with different arities.

The registration of the function in LoadInternal is mostly the same as for the other functions, only the description of the result type is more extensive. The names and types of the STRUCT’s fields have to be constructed as a child_list_t<LogicalType> and passed to LogicalType::STRUCT.

Returning a LIST is very similar: instead of StructTypeBinary, use GenericListType<CHILD_TYPE>, which has a values field that is an std::vector of instances of the child type. As LogicalType in the ScalarFunction constructor, use LogicalType::LIST(<child LogicalType>).

Custom result type with AssignResult

THe implementation of solve_quadratic_equation is definitely not perfect. If we apply it on an equation with no solutions, e.g. solve_quadratic_equation(1, 2, 3), the result is {'x1': -nan, 'x2': -nan}. And if we choose 0 for the first argument, the result is {'x1': -nan, 'x2': -inf}, while the equation actually became a linear equation with x = -c / b as solution (assuming b isn’t 0 as well).

Let’s say we want to return NULL in those situations, rather than the struct. GenericExecutor does not have an equivalent of the ExecuteWithNulls method, but we can take another approach. Instead of choosing StructTypeBinary as result type, we can create our own type. This type must have a static method AssignResult that assigns an instance to the result vector. We can use StructTypeBinary’s implementation of this method as inspiration for our custom type:

struct QuadraticEquationSolution {
  double x1;
  double x2;
  bool exists;

  static void AssignResult(Vector &result, idx_t i,
                           QuadraticEquationSolution solution) {
    auto &entries = StructVector::GetEntries(result);
    if (!solution.exists) {
      FlatVector::SetNull(result, i, true);
    } else {
      FlatVector::GetData<double>(*entries[0])[i] = solution.x1;
      FlatVector::GetData<double>(*entries[1])[i] = solution.x2;
    }
  }
};

inline void SolveQuadraticEquation2ScalarFunc(DataChunk &args,
                                              ExpressionState &state,
                                              Vector &result) {
  auto &a_vector = args.data[0];
  auto &b_vector = args.data[1];
  auto &c_vector = args.data[2];

  GenericExecutor::ExecuteTernary<
      PrimitiveType<double>, PrimitiveType<double>,
      PrimitiveType<double>, QuadraticEquationSolution>(
      a_vector, b_vector, c_vector, result, args.size(),
      [&](PrimitiveType<double> a, PrimitiveType<double> b,
          PrimitiveType<double> c) {
        auto discriminant = b.val * b.val - 4 * a.val * c.val;
        QuadraticEquationSolution solution;
        if (discriminant >= 0) {
          solution.exists = true;
          solution.x1 = (-b.val + sqrt(discriminant)) / (2 * a.val);
          solution.x2 = (-b.val - sqrt(discriminant)) / (2 * a.val);
        } else {
          solution.exists = false;
        }
        return solution;
      });
}

// ...

static void LoadInternal(DatabaseInstance &instance) {
  // ...

  auto solve_quadratic_equation_scalar_function2 = ScalarFunction(
      "solve_quadratic_equation2",
      {LogicalType::DOUBLE, LogicalType::DOUBLE, LogicalType::DOUBLE},
      LogicalType::STRUCT(quadratic_equation_solution_child_types),
      SolveQuadraticEquation2ScalarFunc);
  ExtensionUtil::RegisterFunction(instance,
                                  solve_quadratic_equation_scalar_function2);

  // ...
}

Taking a STRUCT as input argument

StructTypeBinary (and the other StructType... types) cannot only be used as result type, but also as input type. Let’s implement a quadratic_equation_from_solution(STRUCT(x1 DOUBLE, x2 DOUBLE)) -> VARCHAR function that outputs a quadratic equation for which x1 and x2 are solutions:

inline void QuadraticEquationFromSolutionScalarFunc(DataChunk &args,
                                                    ExpressionState &state,
                                                    Vector &result) {
  auto &solution_vector = args.data[0];

  GenericExecutor::ExecuteUnary<StructTypeBinary<double, double>,
                                PrimitiveType<string_t>>(
      solution_vector, result, args.size(),
      [&](StructTypeBinary<double, double> solution) {
        double x1 = solution.a_val;
        double x2 = solution.b_val;

        double b = -x1 - x2;
        double c = x1 * x2;
        return StringVector::AddString(result, "x^2 + " + to_string(b) +
                                                   "x + " + to_string(c));
      });
}

// ...

static void LoadInternal(DatabaseInstance &instance) {
  // ...

  auto quadratic_equation_from_solution_scalar_function = ScalarFunction(
      "quadratic_equation_from_solution",
      {LogicalType::STRUCT(quadratic_equation_solution_child_types)},
      LogicalType::VARCHAR, QuadraticEquationFromSolutionScalarFunc);
  ExtensionUtil::RegisterFunction(
      instance, quadratic_equation_from_solution_scalar_function);
}

D select quadratic_equation_from_solution({ 'x1': 4, 'x2': 3});
┌──────────────────────────────────────────────────────────────────────┐
│ quadratic_equation_from_solution(main.struct_pack(x1 := 4, x2 := 3)) │
│                               varchar                                │
├──────────────────────────────────────────────────────────────────────┤
│ x^2 + -7.000000x + 12.000000                                         │
└──────────────────────────────────────────────────────────────────────┘

Any type that has a static ConstructType method (see for example the StructTypeBinary implementation) can be used to represent an input type (which, as you can see, is more complicated than implementing AssignResult). Also note that while GenericListType implements AssignResult, it currently does not implement ConstructType.

Where to find more examples

That’s it for now! This post still only scratches the surface of scalar functions in DuckDB – as demonstrated by the extensive constructor of ScalarFunction. Nevertheless, this should be enough to get you started with writing your own extensions with scalar functions! All examples in this post are bundled in a GitHub repository. You can find more examples in these places:

The DuckDB source code (specifically extension/core_functions/scalar and src/function/scalar) contains the implementation of DuckDB’s built-in functions.
The spatial extension implements many scalar functions.
A bit beyond the scope of scalar functions, but the ulid extension is a great reference extension if you want to see how you can add a custom datatype and associated functions.

If you have any feedback, feel free to open an issue on the GitHub repository or contact me on Bluesky.