Data Loading with DuckDB
Looking under the hood, this page provides guidance for using DuckDB in Framework data loaders and deploying it within GitHub Actions.
Using DuckDB in Data Loaders
The NYC Taxi Rides and Gaia Star Catalog examples use data loaders to perform data preparation, generating pre-projected data and writing it to a Parquet file.
The shell script below loads taxi data using the command line interface to DuckDB.
The duckdb
executable must be on your environment path... we'll come back to that!
duckdb :memory: << EOF
-- Load spatial extension
INSTALL spatial; LOAD spatial;
-- Project, following the example at https://github.com/duckdb/duckdb_spatial
CREATE TEMP TABLE rides AS SELECT
pickup_datetime::TIMESTAMP AS datetime,
ST_Transform(ST_Point(pickup_latitude, pickup_longitude), 'EPSG:4326', 'ESRI:102718') AS pick,
ST_Transform(ST_Point(dropoff_latitude, dropoff_longitude), 'EPSG:4326', 'ESRI:102718') AS drop
FROM 'https://uwdata.github.io/mosaic-datasets/data/nyc-rides-2010.parquet';
-- Write output parquet file
COPY (SELECT
HOUR(datetime) + MINUTE(datetime) / 60 AS time,
ST_X(pick)::INTEGER AS px, -- extract pickup x-coord
ST_Y(pick)::INTEGER AS py, -- extract pickup y-coord
ST_X(drop)::INTEGER AS dx, -- extract dropff x-coord
ST_Y(drop)::INTEGER AS dy -- extract dropff y-coord
FROM rides) TO 'trips.parquet' WITH (FORMAT PARQUET);
EOF
cat trips.parquet >&1 # Write output to stdout
rm trips.parquet # Clean up
We invoke DuckDB with the :memory:
argument to indicate an in-memory database.
We also use the << EOF
shell script syntax to provide multi-line input, consisting of the desired SQL queries to run.
The last query (COPY ...
) writes a Parquet file to disk.
However, Observable Framework requires that we instead write data to stdout
.
On some platforms we can do this by writing to the file descriptor /dev/stdout
.
However, this file does not exist on all platforms – including in GitHub Actions, where this query will fail.
So we complete the script with two additional commands:
- Write (
cat
) the bytes of the Parquet file tostdout
. - Remove (
rm
) the generated file, as we no longer need it.
Using DuckDB in GitHub Actions
To deploy our Observable Framework site on GitHub, we use a GitHub Actions workflow.
As noted earlier, one issue when running in GitHub Actions is the lack of file-based access to stdout
.
But another, even more basic, issue is that we need to have DuckDB installed!
This snippet installs DuckDB within a workflow.
We download a zip file of the official release, unpack it, copy the duckdb
executable to /opt/duckdb
, and then link to duckdb
in the directory /usr/bin
, ensuring it is accessible to subsequent scripts:
steps:
- name: Install DuckDB CLI
run: |
wget https://github.com/duckdb/duckdb/releases/download/v0.10.0/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
mkdir /opt/duckdb && mv duckdb /opt/duckdb && chmod +x /opt/duckdb/duckdb && sudo ln -s /opt/duckdb/duckdb /usr/bin/duckdb
rm duckdb_cli-linux-amd64.zip
We perform installation before the site build steps, ensuring duckdb
is ready to go.