Qri

Visit qri.io

Components

Dataset Elements


A Dataset is broken into several components. Each component has a different purpose:

componentpurpose
bodylocation of dataset data. The subject all other components are about
readmefree-form text describing the dataset, supports markdown
commitversioning information for this dataset at a specific point in time
metadescriptive metadata
structuremachine-oriented metadata for interpreting body
transformdescription of an executed script that resulted in this dataset

Each component is described in detail below.

Body

Body is the principle content of a dataset. A dataset body is the subject which all other fields describe and qualify.

Supported Data Formats:

  • csv - comma-separated values
  • json - javascript object notation
  • xlsx - microsoft excel open xml spreadsheet
  • cbor - concise binary object representation

The structure of the data stored is arbitrary, with one important exception: the top level of body must be either an object or an array. scalar types like int, bool, float, or strings are not valid types. Keep in mind that it's perfectly valid to wrap a scalar type (for example, a string) in an array to obtain a valid body.

Commit

Commit encapsulates information about changes to a dataset in relation to other entries in a given history. Commit is directly analogous to the concept of a Commit Message in the git version control system. A full commit defines the administrative metadata of a dataset, answering "who made this dataset, when, and why".

commit fields:

nametypedescription
authorUserauthor of this commit
messagestringan optional message that provides detail about changes made
qristringthis commit's qri kind, value should always be cm:0
signaturestringbase58-encoded bytes of body checksum
timestampstringtime this dataset was created with timezone offset
titlestringtitle of the commit. should be a short description of

additional data types:

User

example

  {
    "id": "user_id_12340584",
    "fullname": "sean carter",
    "email": "hova@jayz.com"
  }

Meta

Meta contains human-readable descriptive metadata that qualifies and distinguishes a dataset. Well-defined Meta should aid in making datasets Findable by describing a dataset in generalizable taxonomies that can aggregate across other dataset documents. Because dataset documents are intended to interoperate with many other data storage and cataloging systems, meta fields and conventions are derived from existing metadata formats whenever possible.

All of the meta fields below must be well-formed, valid values. However: The Meta section of a dataset supports arbitrary metadata. This means you can place additional values not listed here & qri will store them as-is, without any additional validation.

Another important note here: qri software doesn't leverage things like identifier, downloadPath, homePath, etc.. Qri considers all fields here descriptive metadata, as opposed to structural metadata. In practice this means qri User interfaces may leverage the meta component for the purposes of correlation and display. Information stored in the meta section is not intended for use by machines to interpret the dataset. For example, setting the identifier is of no significance to qri, it's included here for interoperability with other specifications like DCAT

meta fields:

nametypedescription
accessPathstringURL or location to access this dataset.
accrualPeriodicitystringfrequency with which dataset changes. Must be an ISO 8601 repeating duration
citations[]Citationarray of assets used to build this dataset
contributors[]Userdescription
descriptionstringroughly a paragraph of human-readable text that provides context for the dataset
downloadPathstringURL or other path string to where to download this dataset
homePathstringURL or other path string to a "landing page" resource that explains the dataset
identifierstringidentifier for the dataset
keywords[]stringstring of "tags" to connect this dataset with other datasets that carry similar keywords
language[]stringarray of languages this dataset is written, in ISO 639-1 format, ordered by most-to-least dominant
licenseLicensethe legal licensing agreement this dataset is released under
titlestringtitle of the dataset
theme[]string"categories" this dataset should be filed under. Keywords should draw out specific keywords, theme entires should speak to the broader family of datasets this dataset is part of
versionstringversion identifier string

additional data types:

Citation

example:

  {
    "name" : "sean carter",
    "url" : "https://jayz.com",
    "email" : "hova@jayz.com"
  }

License

example:

  {
    "type" : "CC-BY-2",
    "url" : "https://creativecommons.org/licenses/by/2.0/"
  }

Structure

Structure defines the characteristics of a dataset document necessary for a machine to interpret the dataset body. Structure fields are things like the encoding data format (JSON,CSV,etc.), length of the dataset body in bytes, stored in a rigid form intended for machine use. A well defined structure & accompanying software should allow the end user to spend more time focusing on the data itself.

Two dataset documents that both have a defined structure will have some degree of natural interoperability, depending first on the amount of detail provided in a dataset's structure, and then by the natural comparability of the datasets.

structure fields:

nametypedescription
checksumstringbas58-encoded multihash checksum of the entire data file. this structure points to. This is different from IPFS hashes, which are calculated after breaking the file into blocks
compressionstringspecifies any compression on the source data, if empty assume no compression. warning: not yet implemented in qri
encodingstringspecifics character encoding, assumes utf-8 if not specified
errCountintthe number of errors returned by validating data against schema.
entriesintnumber of top-level entries in the dataset. analogous to the number of rows in a table
formatstringspecifies the format of the raw data type by file extension. Must be one of: json , csv , xlsx, cbor
formatConfigobjectremoves as much ambiguity as possible about how to interpret the speficied format. Properties of this object depend on the format field
lengthintlength of the data object in bytes
schemajsonSchemathe schema definition for the dataset body, schemas are defined using the IETF json-schema specification. for more info on json-schema see: https://json-schema.org

Transform

Transform is a record of executing a transformation on data. Transforms can theoretically be anything from an SQL query, a jupyter notebook, the state of an ETL pipeline, etc, so long as the input is zero or more datasets, and the output is a single dataset Ideally, transforms should contain all the machine-necessary bits to deterministicly execute the algorithm referenced in "ScriptPath".

transform fields:

nametypedescription
scriptPathstringthe path to the script that produced this transformation
syntaxstringlanguage this transform is written in. Only "skylark" is currently supported
syntaxVersionstringan identifier for the application and version number that produced the result
configobjectany configuration that would affect the resulting hash. transformations may use values present in config to perform their operations
resourcesobjectmap of all datasets transform depends on with both name and commit paths