Schemas

Flexible data structures

Schemas define the structure of data at any point within a Pipeline graph.

The editor also enables Schema to be edited and deleted. Schemas cannot be deleted until all Tasks, Pipelines, and Datastores that use them are also deleted and Kaspian will present the user with the relevant dependency conflict chain if a delete operation is requested.

Clicking on the Edit Schema icon will open the following modal:

Schemas can be generated in 3 ways:

  1. From a JSON file on GitHub
  2. From an uploaded JSON file
  3. Manually via the form

Schema validation for each task can be enabled by toggling the Validate Output Schemas button in the task editor. This enforces that the output schema of a task matches the schema specified and will error if casting is unable to properly resolve the types.

Task Schemas

All Tasks have either an input Schema or an output Schema; many require both. The input Schema defines the structure of data entering a Task while the output Schema defines the structure of data exiting a Task.

A Schema consists of an ordered array of fields. A field has four elements: name, description, datatype, and nullable. name refers to the column name being referenced. Note that this value must be unique within a given Schema. description is meant to provide a space for users to add any useful documentation about the field. datatype values must be selected from the following list of supported options:

DatatypeDescription
BINARYBinary values
BOOLEANTrue or false values
BYTEByte values
DATEDates without time values
DOUBLEDouble precision values
FLOATFloating point values
INTEGERInteger values
LONG32-bit signed integer values
SHORT16-bit signed integer values
STRINGText or varchar values
TIMESTAMPDates with time(zone) values

In general, Kaspian is compatible with any datatype supported by Apache Spark and maps types from Datastores to this list the same way a Spark engine would.

The nullable flag is a boolean option that specifies if the value for that specific field is allowed to be null. This option can serve as a valuable data integrity check for required fields.

Uploading Schema JSON files

Larger schemas can be provided as JSON files. These files must be in the following format:

[
  {
    "name": "field1",
    "description": "description of field1",
    "dataType": "StringType",
    "nullable": true
  },
  {
    "name": "field2",
    "description": "description of field2",
    "dataType": "IntegerType",
    "nullable": false
  }
]

The dataType naming is based on the Spark data types.

Datastore Schemas

Tables registered in flat file/data lake environments such as AWS S3 can be added as Datastores. This abstraction allows these resources to behave identically to SQL Datastores such as Snowflake and Postgres within the Kaspian compute layer. Kaspian requires that these Datastores have a Schema attached so that data integrity can be programmatically enforced. It is recommended that Datastore Schemas have global scope so that they can be reused by other Pipelines.