Datasets
Dataset files define datasets to use in the repository. Each dataset
file in your repository must correspond to either a physical table/view
in your database, or the results of a SELECT
statement.
Note: Dataset files must define all columns in the physical tables they reference, and can therefore be quite large. Because of this, AtScale recommends sharing these files across repositories.
Dataset files support the following properties.
unique_name
- Type: string
- Required: Y
The unique name of the dataset. This must be unique across all repositories and subrepositories.
object_type
- Type: const
- Required: Y
The type of object defined by the file. For datasets, the value of this
property must be dataset
.
label
- Type: string
- Required: Y
The name of the dataset, as it appears in AtScale. This value does not need to be unique.
connection_id
- Type: string
- Required: Y
The unique_name
of the connection object that defines the database and
schema in which the dataset is stored.
sql
- Type: string
- Required: Required if
table
is not defined
A SQL query used to pull data from a specific connection defined within the repository, similar to a database view. This determines whether the dataset file defines a query dataset.
table
- Type: string
- Required: Required if
sql
is not defined
The name of the table in the database that the dataset is based on.
columns
- Type: array
- Required: Y
Defines the columns available in the dataset.
Note: You should define all columns available in the dataset. This is especially important for dataset files that are shared across multiple repositories.
The columns
property within a dataset file supports the following
properties.
name
- Type: string
- Required: Y
The name of the column.
data_type
- Type: string
- Required: Required unless this column is a
map
The data type of the values within the column.
Supported values:
string
int
long
bigint
tinyint
float
double
decimal
decimal(x,y)
number
number(x,y)
numeric(x,y)
boolean
date
datetime
timestamp
sql
- Type: string
- Required: N
Defines the column as a calculated column.
Calculated columns enable you to add simple data transformations to the dataset. These can be used as the basis of model attributes, just like any other dataset column. For more information, see Adding Calculated Columns To Datasets For Simple Data Transformations.
The value of this property should be a valid SQL statement that can be
run as part of the SELECT
list of a query.
The SQL statement is passed directly to the underlying database when the
query runs, so it must be in a syntax that is supported by your chosen
engine. If you want to run the query on other types of databases, use
the dialects
property to define additional dialects for it to run in.
dialects
- Type: array
- Required: N
Defines alternate dialects for the sql
statement so that it can run on
other types of databases. You can define as many alternate dialects as
needed.
Supported properties:
-
dialect
: String, required. An alternate SQL dialect. Supported values:Snowflake
Postgresql
DatabricksSQL
BigQuery
Iris
-
sql
: String, required. The alternate SQL statement.
map
- Type: object
- Required: N
Defines a map to use to create a calculated column.
Supported properties:
field_terminator
: String, required. The delimiter used to separate the key:value pairs. This must be in quotes (").key_terminator
: String, required. The delimiter used to separate the individual keys from their values. This must be in quotes (").key_type
: String, required. The data type of the map's keys.value_type
: String, required. The data type of the map's values.
The mapped columns are defined as separate columns within the dataset
file. Each of these must have the parent_column
property.
For more information on maps in AtScale, see Extract Values From A Map.
parent_column
- Type: string
- Required: Required for mapped columns
For mapped columns only. Specifies the map
column used to create this
column.
description
- Type: string
- Required: N
A description of the dataset.
incremental
- Type: object
- Required: N
Enables incremental builds for the dataset. When the engine performs an incremental rebuild of an aggregate table, it appends new rows and updates existing rows that fall within a specified period of time (called the grace period). For more information, see About Incremental Rebuilds.
The incremental
property supports the following properties.
column
- Type: string
- Required: Y
The name of the dataset column to use as an incremental indicator. This column must have values that increase monotonically, such as a numeric UNIX timestamp showing seconds since epoch, or a Timestamp/DateTime. The values in this column enable the AtScale engine both to append rows to an aggregate table and update rows during an incremental rebuild.
The values of this column should be of one of the following data types: Long, Integer, Decimal (38,0) (Snowflake only), Timestamp, DateTime.
If you do not have a column that meets this criteria, you may need to create a calculated column to transform string or datetime values into the right data type.
grace_period
- Type: string
- Required: Y
When the AtScale engine starts an incremental build, the grace_period
determines how far back in time the engine looks for updates to rows in
the dataset; for example, one week or 15 days.
The value of this property should be an integer, followed by the time
unit. The time unit can be any of the following: s
(second), m
(minute), h
(hour), d
(day), w
(week).
For example, setting the value to '100s'
sets the grace period to 100
seconds. Setting it to '1w'
sets the grace period to one week.
immutable
- Type: boolean
- Required: N
Determines whether the dataset changes often or not. The AtScale engine uses this information when running incremental builds of aggregates that use joins on dimensions that do not change often.