Qri

Visit qri.io

Transform Scripts

automating qri dataset version creation & maintenance


Introduction

We want to build better tools for maintaining a catalog. We make things "better" in two ways:

  1. Scale up the amount of data a single data publisher can effectively manage.
  2. Make it easier to become a data publisher.

A data publisher is anyone that produces a dataset that someone else uses. Data publishers have at least some degree of domain specific expertise. They know how to construct a dataset on their chosen subject matter from first principles. This expertise often places them in a position of serving many downstream users.

The primary mechanism for scaling up datasets-under-management is automation. In an automated data catalog, data publishers write programs that tell computers how and when to update datasets. We make it easier to become a data publisher by making programs that data authors write into sharable pieces that others can learn from.

Design Goals

Qri's programming environment is optimized for the following characteristics:

  • Familiar. syntax should match tools data publishers are already familiar with.
  • Portable. programs need to be easy to deploy. Transform scripts need to run on many computing hosts: local computers, cloud servers & in the browser. Programs need to be able to read & write state they depend on for execution from all of these environments.
  • Predictable. primary and side effects should be cheap to compute through static analysis
  • Safe. adding a sandbox to programs that are portable and predictable rounds out our definition of safety.

Transform Scripts

A transform script is a sequence of one or more programs. Each program is called a step, and takes a dataset as input and modifies (transforms) the input dataset, returning a dataset as the result. Steps are chained together, passing output of the prior step in the sequence as the input of the next. The output of a transform script is the output of the final step in the sequence. The execution model of a transform script resembles a reducer function:

// Steps accept a dataset, return a dataset
type Step = (ds: Dataset) => Dataset
// Scripts are a sequence of steps
type Script = Step[]

// the result of a script is 
const result = new Script[...].reduce(previous)

When a transform script is executed, the script is passed the prior version of a dataset as input, and the output is persisted as the next version in a dataset's history. When no prior version exists the script is passed an empty identity dataset.

A Transform script is stored within a component of the dataset they are executed upon, binding the script to the data it created. A dataset with a transform component can be automated by adding the dataset to a scheduler that re-executes the script.

Starlark

Each transform step has a syntax key that will allow switching syntaxes on a per-step basis. For the time being, however, the only planned syntax is starlark.

Starlark gives us familiar python syntax, better control over execution & more opportunities for static analysis. For more details on Qri Starlark, see here.

Steps

Header node needs to record syntax, runtime version, runtime commit hash. Dependencies are hoisted into the header node and declared in a map

// TransformStep is a unit of operation in a transform script
type TransformStep struct {
	Name     string      `json:"name"`           //  human-readable name for the step, used for display purposes
	Description string   `json:"category"`       // human-readable description
	Syntax   string      `json:"syntax"`         // execution environment, eg: "starlark", "qri-sql"
	Path     string      `json:"path,omitempty"` // path to this step if persisted separately on disk
	Script   interface{} `json:"script"`         // input text for transform step. often code, or a query.
	// Resources is a map of all datasets referenced in this step, with 
  // alphabetical keys generated by datasets in order of appearance within the
  // transform
	Resources map[string]TransformResource `json:"resources,omitempty"`
}

type TransformResource struct {
  Type string
  Version string
  Path string
}

Step Caching

Expensive or hard-to-repeat steps can be cached a step by executing the step and replacing subsequent calls with the result of the initial execution.

Through static analysis qri can infer caching opportunities on behalf of the user and optimistically cache for them. Users can also opt into caching to avoid repeatedly performing expensive operations. If a script has no external drivers of side effects (eg: one or more HTTP calls), the step script has not semancially changed, and the input dataset remains the same, qri can optimistically cache.

Opting into caching is an inversion of the common default in data science platforms like ipython notebooks, which cache by default and require re-execution. In practice cache-by-default is a source of confusion for users, who must remember to re-execute cells after they have changed.

Dependencies

Static Analysis

One of the prime motivators for choosing starlark is opening the door to static analysis

Program counter

The primary mechanism for bounding the execution of a script step is placing a hard upper limit on it's program counter.

We want to avoid depending on wall clock time to bound execution, dataset transform scripts take arbitrary amounts of time to run, and are highly dependant on the

Persistence

Input File Chunking

A starlark transform script can be provided as a plaintext file, chunked into steps by a line of three-dashes ---

---
# title: identity function
# description: this step does nothing

def transform(ds):
---
# syntax: starlark
# 
load("time/time.star", "time")

def transform(ds);
  ds.body = [[time.now()]]
---

The above input file would intepret into the following transform component:

{
  "qri": "tf:0",
  "steps": [
    {
      "syntax": "starlark",

    }
  ],
}

Steps are referred to by their zero-index for technical discussions, but will be 1-indexed in user interfaces.

Some notes:

  1. The default syntax is assumed to be starlark. (shown in steps 0 & 2)
  2. chunk syntax can differ between steps. explicitly set it with

Privacy

Scripts within a private dataset are encrypted at rest.

User Transform library

Users writing scripts that transform data have a

Each user has a “library”

Notes & Errata

Unresolved issues

  • Intersection with Object Capability Model
  • Intersection with wnfs dag layout: Treat each step as a header block, script as a file. This’ll support independent versioning across steps.

Future Research

  • Observable’s import syntax
  • Deno’s import memoization rules
  • Go dependency graphs, go mod & sum syntax