Skip to main content

Warehouse-Level Machine Learning

AI-Link can push data science and/or machine learning workloads to the warehouse level, allowing users to perform inference and explore data without moving it or incurring heavy computational costs on their own machine. Instead of extracting data from the warehouse and analyzing it locally, a user working in Python can simply call a function that performs all necessary computation in their data warehouse.

Linear Regression

Linear regression predicts the value of some target feature y via a linear combination of n other predictor features x_1, ..., x_n, i.e.:

y = c_1x_1 + ... + c_nx_n

AI-Link determines the set of constants c_1, ..., c_n for the features x_1, ..., x_n so that users can predict the value of y using just the predictor features.

Below is an example of how one would call this functionality:

from atscale.eda.linear_regression import linear_regression

coefs = linear_regression(
dbconn=db,
data_model=data_model,
predictors=["total_number_of_customers"],
prediction_target=["total_sales"],
granularity_levels=["month"]
)

In the above example, db is a SQLConnection object (e.g., a Snowflake object) corresponding to the user's data warehouse, and data_model is the user's data model. The predictors parameter is the list of features in the user's data model serving as inputs to the regression (i.e., x_1, ..., x_n in the model definition above), while the prediction_target is the feature being estimated (i.e., y). The granularity_levels parameter specifies the granularity of the output as well as the training data passed to the model – in the example above, it specifies that the total customers will be aggregated at the month level in order to forecast the total monthly sales. Lastly, the coefs output contains the coefficients c_1, ..., c_n referenced above.

Principal Components Analysis

Principal components analysis (PCA) is an exploratory data analysis technique for quantifying the interrelation of features in a dataset. For a dataset with n features, PCA returns a collection of principal components – i.e., the vectors x_1, ..., x_n – as well as a collection of corresponding scalar weights w_1, ..., w_n. At a high level, each vector describes a pattern of interrelation among the n features, and each corresponding weight describes how prevalent the pattern is in the data.

Below is an example of how one would call this functionality:

from atscale.eda.pca import pca

pcs, weights = eda.pca(
dbconn=db,
data_model=data_model,
pc_num=2,
numeric_features=[
"Stock_A_price",
"Stock_B_price",
"Stock_C_price"
],
granularity_levels=["date"]
)

In the above example, db is a SQLConnection object (e.g., a Snowflake object) corresponding to the user's data warehouse, and data_model is the user's data model. The pc_num parameter indicates how many principal components/weights the user wants returned. The numeric_features parameter indicates the features making up the dataset – in this case, the features consist of time series data for three different stocks. The granularity_levels parameter specifies the granularity of the dataset; in this case, daily stock prices are provided.

With this output, a user could investigate whether price fluctuations in Stocks A, B, and C are related – and if so, by how much.

Summary Statistics

AI-Link also supports warehouse-level computation of familiar summary statistics like variance, standard deviation, covariance, and correlation for features in a user's data model. For instance, a user can find the standard deviation of the total_sales feature measured daily as follows:

from atscale.eda.stats import std

stdev = std(
dbconn=db,
data_model=data_model,
feature="total_sales",
granularity_levels=["date"]
)

In the above example, db is a SQLConnection object (e.g., a Snowflake object) corresponding to the user's data warehouse, and data_model is the user's data model.