Qri
Transform Scripts
automating qri dataset version creation & maintenance
Introduction
We want to build better tools for maintaining a catalog. We make things "better" in two ways:
- Scale up the amount of data a single data publisher can effectively manage.
- Make it easier to become a data publisher.
A data publisher is anyone that produces a dataset that someone else uses. Data publishers have at least some degree of domain specific expertise. They know how to construct a dataset on their chosen subject matter from first principles. This expertise often places them in a position of serving many downstream users.
The primary mechanism for scaling up datasets-under-management is automation. In an automated data catalog, data publishers write programs that tell computers how and when to update datasets. We make it easier to become a data publisher by making programs that data authors write into sharable pieces that others can learn from.
Design Goals
Qri's programming environment is optimized for the following characteristics:
- Familiar. syntax should match tools data publishers are already familiar with.
- Portable. programs need to be easy to deploy. Transform scripts need to run on many computing hosts: local computers, cloud servers & in the browser. Programs need to be able to read & write state they depend on for execution from all of these environments.
- Predictable. primary and side effects should be cheap to compute through static analysis
- Safe. adding a sandbox to programs that are portable and predictable rounds out our definition of safety.
Transform Scripts
A transform script is a sequence of one or more programs. Each program is called a step, and takes a dataset as input and modifies (transforms) the input dataset, returning a dataset as the result. Steps are chained together, passing output of the prior step in the sequence as the input of the next. The output of a transform script is the output of the final step in the sequence. The execution model of a transform script resembles a reducer function:
// Steps accept a dataset, return a dataset
type Step = (ds: Dataset) => Dataset
// Scripts are a sequence of steps
type Script = Step[]
// the result of a script is
const result = new Script[...].reduce(previous)
When a transform script is executed, the script is passed the prior version of a dataset as input, and the output is persisted as the next version in a dataset's history. When no prior version exists the script is passed an empty identity dataset.
A Transform script is stored within a component of the dataset they are executed upon, binding the script to the data it created. A dataset with a transform component can be automated by adding the dataset to a scheduler that re-executes the script.
Starlark
Each transform step has a syntax
key that will allow switching syntaxes on a per-step basis. For the time being, however, the only planned syntax is starlark
.
Starlark gives us familiar python syntax, better control over execution & more opportunities for static analysis. For more details on Qri Starlark, see here.
Steps
Header node needs to record syntax, runtime version, runtime commit hash. Dependencies are hoisted into the header node and declared in a map
// TransformStep is a unit of operation in a transform script
type TransformStep struct {
Name string `json:"name"` // human-readable name for the step, used for display purposes
Description string `json:"category"` // human-readable description
Syntax string `json:"syntax"` // execution environment, eg: "starlark", "qri-sql"
Path string `json:"path,omitempty"` // path to this step if persisted separately on disk
Script interface{} `json:"script"` // input text for transform step. often code, or a query.
// Resources is a map of all datasets referenced in this step, with
// alphabetical keys generated by datasets in order of appearance within the
// transform
Resources map[string]TransformResource `json:"resources,omitempty"`
}
type TransformResource struct {
Type string
Version string
Path string
}
Step Caching
Expensive or hard-to-repeat steps can be cached a step by executing the step and replacing subsequent calls with the result of the initial execution.
Through static analysis qri can infer caching opportunities on behalf of the user and optimistically cache for them. Users can also opt into caching to avoid repeatedly performing expensive operations. If a script has no external drivers of side effects (eg: one or more HTTP calls), the step script has not semancially changed, and the input dataset remains the same, qri can optimistically cache.
Opting into caching is an inversion of the common default in data science platforms like ipython notebooks, which cache by default and require re-execution. In practice cache-by-default is a source of confusion for users, who must remember to re-execute cells after they have changed.
Dependencies
Static Analysis
One of the prime motivators for choosing starlark is opening the door to static analysis
Program counter
The primary mechanism for bounding the execution of a script step is placing a hard upper limit on it's program counter.
We want to avoid depending on wall clock time to bound execution, dataset transform scripts take arbitrary amounts of time to run, and are highly dependant on the
Persistence
Input File Chunking
A starlark transform script can be provided as a plaintext file, chunked into steps by a line of three-dashes ---
---
# title: identity function
# description: this step does nothing
def transform(ds):
---
# syntax: starlark
#
load("time/time.star", "time")
def transform(ds);
ds.body = [[time.now()]]
---
The above input file would intepret into the following transform component:
{
"qri": "tf:0",
"steps": [
{
"syntax": "starlark",
}
],
}
Steps are referred to by their zero-index for technical discussions, but will be 1-indexed in user interfaces.
Some notes:
- The default syntax is assumed to be starlark. (shown in steps 0 & 2)
- chunk syntax can differ between steps. explicitly set it with
Privacy
Scripts within a private dataset are encrypted at rest.
User Transform library
Users writing scripts that transform data have a
Each user has a “library”
Notes & Errata
Unresolved issues
- Intersection with Object Capability Model
- Intersection with wnfs dag layout: Treat each step as a header block, script as a file. This’ll support independent versioning across steps.
Future Research
- Observable’s import syntax
- Deno’s import memoization rules
- Go dependency graphs, go mod & sum syntax