Registering with the CLI

Gable supports explicitly registering data assets with our Python CLI using the gable data-asset register command. The following data assets are currently supported: Relational Databases

Postgres
MySQL
MS SQL Server

Serialization Formats

Protobuf
Avro
JSON Schema

Static Code Analysis

Python objects
PySpark
Typescript
S3

Registering Relational Database Tables

The same strict security requirements that make it difficult to automatically scrape data assets from production relational databases often also apply to explicitly registering tables as data assets. To enable registration without the need to connect to a production database instance, Gable relies on the concept of a “proxy” database. A proxy database is a database instance that mirrors your production database’s schema, but is accessible locally or in your CI/CD environment. Instead of connecting to a production database, Gable can be configured to connect to its proxy instead, registering any tables it finds as if it were connected to production. The proxy database concept removes the need to grant access to your production data, as well as eliminating the possibility of impacting the performance of your production database in any way. A proxy database can be a local Docker container, a Docker container that is spun up in your CI/CD workflow, or a database instance in a test/staging environment. The tables in the proxy database must have the same schema as your production database for all tables to be correctly registered, so if you’re registering tables as data assets in your CI/CD workflows it’s important to only do so from your main branch. If you already start a database Docker container in your CI/CD workflows for integration testing, Gable can be configured to use that same container at the end of the test run. When using a proxy database, you specify both the production host/port/schema, as well as those of the proxy. The production information is required to compute the unique data asset resource name for each discovered table so they can be registered as if they came from the production database.

Example

Register the tables of production service-one.aaaaaaaaaaaa.eu-west-1.rds.amazonaws.com database instance using a local Docker Postgres instance as the proxy datbase. The local Docker database first has the service’s migrations applied so its schema mirrors that of production.

# Install the CLI
pip install gable
# Start a local Postgres Docker container
docker run --name serviceone_proxy -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres
# Apply the service's migrations, this assumes you have a yarn script that knows to apply
# the migrations to the local Postgres instance when passed the `--test` flag
yarn migrate --test
# Register the service's tables as data assets using the connection details of the production
# database by connecting to the proxy instance
gable data-asset register --source-type mysql \
    --host 'service-one.aaaaaaaaaaaa.eu-west-1.rds.amazonaws.com' --port 5432  \
    --db prod_serviceone --schema public   --proxy-host 'localhost' --proxy-port 5432 \
    --proxy-db test_serviceone --proxy-user postgres --proxy-password postgres

✅ Registration successful

Registering Protobuf, Avro or JSON Schema Files

Registering a service’s Protobuf, Avro, or JSON schema files is straightforward as the only requirement is having the service’s git repository checked out locally. The CLI supports registering multiple files, either as a space separated list (file1.proto file2.proto), or as a glob pattern. The register command must be run within the repository’s directory, as it uses the repo’s git information to construct the unique resource name for the data assets it discovers.

Example

# Install the CLI
pip install gable
# Check out the service's git repository
git checkout https://github.com/company/serviceone.git
# cd into the repository's directory
cd serviceone
# Register the service's protobuf files as data assets
gable data-asset register --source-type protobuf \
   --files ./proto/*.proto

✅ Registration successful

Static Code Analysis

Gable can automatically identify and register data-generating code across your codebase for a comprehensive view of your first-party data assets. Using static code analysis, Gable recognizes code that generates data payloads, capturing the structure and types as a re-usable data asset definition. This feature supports Python, PySpark, and Typescript and we’ll soon be expanding support to other languages. Following the examples below, you can use the Gable CLI to have your code bundled, transmitted, and analyzed by Gable for native type detection and asset registration. Please note that bundling and transmission of your code is necessary for our Gable static analysis, but rest assured that your code will not be persisted on Gable servers. In future releases, we will add the ability to run the static code analysis completely within your CI/CD pipeline.

Python

Example: Registering Python Emitter

gable data-asset register --source-type python \
  --project-root <project_root> \
  --emitter-file-path <path_to_your_emitter_function> \
  --emitter-function <emitter_function_name> \
  --emitter-payload-parameter <payload_param> \
  --event-name-key <key>

When registering a Python project, it’s important to specify the project’s entry point, which is the root directory of your project. This allows Gable to correctly identify and bundle the project for analysis. Additionally, specifying the emitter function and event name key helps Gable understand how your project interacts with and emits data, ensuring accurate tracking and management.

Register Python Options

--source-type: Specify source. Python in this case
--project-root: Specifying the project’s entry point for proper bundling
--emitter-file-path: Identify the emitter function location
--emitter-function: Identify the emitter function
--emitter-payload-parameter: Identify payload parameter within the emitter function
--event-name-key: Define the property of the event to distinguish event types

It’s crucial to ensure that the information provided reflects the actual state of your project in your production environment. When using these options as part of your CI/CD workflow, make sure to register your Python projects from the main branch. Registering from feature or development branches may lead to inconsistencies and inaccuracies in the data asset registry, as these branches may contain code that is not yet, or may never be, deployed to production.

Example: Registering Python Emitter (Detailed)

Let’s illustrate this better using a hypothetical social media application, “NoodleBase” (named after our company mascot, Noodle 😸). NoodleBase exists in the working directory “/code/noodleBase” with a specific event tracking helper module at “/code/noodleBase/utils/event_tracking.py”. The emitter function that NoodleBase developers use to track hypothetical events in this project is send_analytics_payload and the parameter we pass into is called payload.

def send_analytics_payload(payload):
    print("Sending analytics payload:", payload)
    send_event_to_kafka(payload)

NoodleBase developers might leverage this “send_analytics_payload” helper like so:

# Tracks when a user likes a post
send_analytics_payload({
    "_type": "like_post",
    "_dt": datetime.now(),
    "_source": "noodleBase.web_backend",
    "_uuid": uuid.uuid4(),
    "_version": "0.3.0",
    "path": request.get_full_path(),
    "is_mobile": is_mobile(request),
    "language": "en-us",
    "username": user_profile.user.username,
    "user_website": user_profile.website,
    "user_bio": user_profile.bio,
    "user_phone": user_profile.phone,
    "user_address": user_profile.address,
})

The payload for this event adheres to the above structure, so we need to identify the property of this payload that will distinguish it from other events. In this case, that will be the _type property.Now that we’ve narrowed down the emitter function, its payload param name, and the distinguishing payload property, we’re ready to analyze this code.

# from the noodleBase directory
gable data-asset register --source-type python \
  --project-root . \
  --emitter-file-path utils/event_tracking.py \
  --emitter-function send_analytics_payload \
  --emitter-payload-parameter payload \
  --event-name-key _type

...

✅ Registration Successful

In another scenario, the actual event we want to register may be nested inside the payload returned by “send_analytics_payload”:

# Tracks when a user likes a post
send_analytics_payload({
    "record_event": {
      "_type": "like_post",
      "_dt": datetime.now(),
      "_source": "noodleBase.web_backend",
      "_uuid": uuid.uuid4(),
      "_version": "0.3.0",
      "path": request.get_full_path(),
      "is_mobile": is_mobile(request),
      "language": "en-us",
      "username": user_profile.user.username,
      "user_website": user_profile.website,
      "user_bio": user_profile.bio,
      "user_phone": user_profile.phone,
      "user_address": user_profile.address,
    }
})

Since the event is nested within the constructed payload, we need to pass the access path to tell the Gable CLI how to locate the event and event name within the payload. We pass in “record_event” followed by ”._type” to indicate that the event key name is nested within the value at the “record_event” key.

# from the noodleBase directory
gable data-asset register --source-type python \
  --project-root . \
  --emitter-file-path utils/event_tracking.py \
  --emitter-function send_analytics_payload \
  --emitter-payload-parameter payload \
  --event-name-key record_event._type

...

✅ Registration Successful

In the final scenarios, the actual event we want to register is nested within a dictionary and an array within the payload returned by “send_analytics_payload”:

# Tracks when a user likes a post
send_analytics_payload({
    "record_event": [{
      "_type": "like_post",
      "_dt": datetime.now(),
      "_source": "noodleBase.web_backend",
      "_uuid": uuid.uuid4(),
      "_version": "0.3.0",
      "path": request.get_full_path(),
      "is_mobile": is_mobile(request),
      "language": "en-us",
      "username": user_profile.user.username,
      "user_website": user_profile.website,
      "user_bio": user_profile.bio,
      "user_phone": user_profile.phone,
      "user_address": user_profile.address,
    }]
})

The event is located within the payload dictionary and an array, so we need to pass the dictionary key name “record_event”, followed by the array index ”.[0]” and finally ”._type” to indicate the event key name within the array element.

# from the noodleBase directory
gable data-asset register --source-type python \
  --project-root . \
  --emitter-file-path utils/event_tracking.py \
  --emitter-function send_analytics_payload \
  --emitter-payload-parameter payload \
  --event-name-key "record_event.[0]._type"

...

✅ Registration Successful

Alternatively, a payload can also contain multiple events with the same or distinct event names. Note that all the events need to use the same event name key.

# Tracks when a user likes a post and dislikes a post
send_analytics_payload({
    "record_event": [{
      "_type": "like_post",
      "_dt": datetime.now(),
      "_source": "noodleBase.web_backend",
      "_uuid": uuid.uuid4(),
      "_version": "0.3.0",
      "path": request.get_full_path(),
      "is_mobile": is_mobile(request),
      "language": "en-us",
      "username": user_profile.user.username,
      "user_website": user_profile.website,
      "user_bio": user_profile.bio,
      "user_phone": user_profile.phone,
      "user_address": user_profile.address,
    },
    {
      "_type": "dislike_post",
      "_dt": datetime.now(),
      "_source": "noodleBase.web_backend",
      "_uuid": uuid.uuid4(),
      "_version": "0.3.0",
      "path": request.get_full_path(),
      "is_mobile": is_mobile(request),
      "language": "en-us",
      "username": user_profile.user.username,
      "user_website": user_profile.website,
      "user_bio": user_profile.bio,
      "user_phone": user_profile.phone,
      "user_address": user_profile.address,
    }]
})

In this scenario, we will utilize the wildcard array index ”.[*]” to indicate that all the events within the array in “record_event” need to be registered.

# from the noodleBase directory
gable data-asset register --source-type python \
  --project-root . \
  --emitter-file-path utils/event_tracking.py \
  --emitter-function send_analytics_payload \
  --emitter-payload-parameter payload \
  --event-name-key post_event.[*]._type

...

✅ Registration Successful

PySpark

Gable’s support for PySpark projects enables you to analyze and register data assets from PySpark jobs. This addition caters to a wider range of data engineering and analysis workflows, enhancing data asset management within your organization.

Example: Registering PySpark Projects

gable data-asset register --source-type pyspark \
  --project-root . \
  --spark-job-entrypoint job.py \
  --connection-string hive://localhost:10000

Register PySpark Options

--source-type - Set to pyspark for PySpark projects
--project-root - The directory containing the PySpark job to be analyzed
--spark-job-entrypoint - The command to execute the Spark job, including any argument
--connection-string - Connection string to the Hive metastore
--csv-schema-file - Path to csv file containing the schema of upstream tables, formatted with columns table_name, column_name, and column_type

Example Schema CSV Content:

table_name,column_name,column_type
area,subcity_name,STRING
area,area_name,STRING
area,area_id,STRING
area,geohash,STRING
area,city_id,STRING
city,latitude,DOUBLE
city,windows_time_zone,STRING
city,territory,STRING
city,is_capital,BOOLEAN
city,city_code,STRING

Typescript

Gable’s integration of the Typescript Data Assets feature allows for the registration and management of data assets for Typescript projects. This enhancement is specifically designed to support data asset tracking in software development workflows, augmenting the control and consistency of data usage across your applications. Gable natively supports some common event publishing libraries like Segment & Amplitude making registration as simple as specifying --library segment. If we don’t yet support the library you use, you can also specify a User Defined Function (UDF), which is a shared function within your code (helper function) that is used to publish events.

Example: Registering Typescript Projects (Supported Library)

gable data-asset register --source-type typescript \
  --project-root . \
  --library segment

Example: Registering Typescript Projects (UDF: Event Name Parameter)

In this example, a parameter of the UDF (eventName) is used to set the event name when publishing.Example Event Publishing UDF

// src/lib/events.ts
// Note: the event name is passed in as a parameter to the function
function trackEvent(eventName: string, eventProperties: object) {
  ...
}

Register command using —emitter-name-parameter

#
gable data-asset register --source-type typescript \
  --project-root . \
  --emitter-file-path file.ts \
  --emitter-function trackEvent  \
  --emitter-name-parameter eventName \
  --emitter-payload-parameter eventProperties

Example: Registering Typescript Projects (UDF: Event Name Key)

In this example the event name is a property of the event payloadExample Event Publishing UDF

// src/lib/events.ts
// Note: the event name is a property of the event payload
function trackEvent(eventProperties: object) {
  const eventName = eventProperties['__event_name'];
  ...
}

Register command using —event-name-key

gable data-asset register --source-type typescript \
  --project-root . \
  --emitter-file-path src/lib/events.ts \
  --emitter-function trackEvent  \
  --event-name-key __event_name \
  --emitter-payload-parameter eventProperties

Register Typescript Options

Required

--source-type - Set to typescript to register events in Typescript
--project-root - The directory containing the Typescript project to be analyzed

Supported Libraries

--library - The natively supported library used to publish data, usually events

User Defined Function

--emitter-file-path src/lib/events.ts - The path to the file containing the UDF
--emitter-function trackEvent - The name of the UDF
--emitter-payload-parameter eventProperties - The name of the function parameter representing the event payload
--emitter-name-parameter eventName - [Optional] The name of the function parameter representing the event name. Use either this option, or --event-name-key __event_name. See above examples.
--event-name-key __event_name - [Optional] The name of the event property representing the event name. Use either this option, or --event-name-key __event_name. See above examples.

S3

Gable now extends its asset registration capabilities to include data assets stored in AWS S3. This functionality is crucial for organizations managing data across distributed storage systems and aims to streamline the integration of S3 data into your data governance framework.

What are S3 Data Assets in Gable?

Managing large volumes of data in S3 and understanding changes to that data can be challenging. Gable helps you better understand your structured data by running its targeted inference algorithm against your S3 bucket. Gable evaluates the structured data in your S3 bucket by inferring the naming conventions you’re utilizing. We then sample data from matching files to build an understanding of the types within these files and register them as a Gable data asset, allowing you to detect and respond to unexpected changes in your S3 bucket.

Example S3 Data Asset:

Imagine an S3 bucket containing CSV files regularly written at hourly intervals. This data is structured within S3 using the /{fileType}/{YYYY}/{MM}/{DD}/*.csv naming convention.Our client will recognize data matching this convention and sample slices of this data to recognize types that can then be registered as Gable assets.One such file in this hypothetical example that we might sample is this example at s3://my-bucket/csv/2024/04/06/shipments.csv.Contents:

Shipment ID,Order ID,Customer Name,Address,Quantity,Shipment Date,Carrier,Tracking Number,Notes

e3803f12-a45e-4660-9d13-6ece4d732f07,fb0b747e-b69c-4524-a232-f5ba94a766da,Shannon Sanchez,"79310 Shah Ferry, Timothybury, ND 50004",1200,2024-04-06 23:12:06.198845,UPS,ee41e779-5066-4db9-824a-531a08fd9972,Customer requested doorstep delivery. Please ensure the package is left in a secure location as per the customer's additional instructions.

be0db141-09ea-4121-ac07-2896ca7fa2db,fcc10f9b-6cc8-4794-9c0f-08356e93b2a5,James Duarte,"335 Nicole Circles Suite 952, Kelleyville, FM 25301",60,2024-04-06 23:06:42.305693,UPS,b01f303e-fa0a-404a-ba9a-d86621b0292e,Fragile items enclosed. Handle with care and ensure upright positioning during transport.

...

From this file, Gable can infer the following data asset, allowing you to detect any future type changes that might be written to respective directories within the S3 bucket.The data asset in Gable would look like the following:

{
  Pattern: "csv/{YYYY}/{MM}/{DD}/shipments.csv"
  Schema: {
    "fields": [
      {
        "type": "string",
        "name": "Shipment ID"
      },
      {
        "type": "string",
        "name": "Order ID"
      },
      {
        "type": "string",
        "name": "Customer Name"
      },
      {
        "type": "string",
        "name": "Address"
      },
      {
        "type": "int",
        "name": "Quantity"
      },
      {
        "type": "string",
        "name": "Shipment Date"
      },
      {
        "type": "string",
        "name": "Carrier"
      },
      {
        "type": "string",
        "name": "Tracking Number"
      },
      {
        "type": "string",
        "name": "Notes"
      }
    ],
    "type": "struct"
  }
}

Supported File Types

Supported Files
CSV
JSON

CSV Types	Gable Types
Number	int
String	string
Date	string

JSON Types	Gable Types
int	int
string	string

Registering S3 Data Assets

To register S3 data assets, you can use the gable data-asset register command with specific parameters tailored to S3 sources.

Command Usage

To register data assets from S3, specify the source type as s3, along with the bucket name. Here is the command structure:

gable data-asset register --source-type s3 --bucket <bucket_name>

Register S3 Options

Required

--source-type - Set to s3 to register events in S3 bucket
--bucket - Name of bucket in S3

Optional

--include-prefix - Specifies what prefixes to include in your S3 bucket. If not specified, all files in the bucket will be analyzed. Multiple include prefixes can be specified like --include-prefix 'path/prefix1' --include-prefix 'path/prefix2'
--exclude-prefix - Specifies what prefixes to exclude in your S3 bucket. If include prefixes are also specified, then an exclude prefix must be a subset of at least one include prefix. Multiple exclude prefixes can be specified like --exclude-prefix 'path/prefix1' --exclude-prefix 'path/prefix2'
--lookback-days - Determines the number of days to include before the latest date in the found S3 paths. Defaults to 2. Ex: If the latest path is 2024-01-02, and lookback_days is 3, then the paths analyzed will contain 2024-01-02, 2024-01-01, and 2023-12-31.
--history - This flag allows you to perform an analysis spanning between two specific dates.
--skip-profiling - Turns off data profiling of found schemas. Saves processing time.
--row-sample-count - Specifies number of rows of data per file to sample for schema detection and data profiling. Defaults to 1000. As this number increases, accuracy increases, but processing time and AWS (reading) costs also increase.
--recent-file-count - Specifies the number of most recent files whose schema will be used for inference per data asset. Default is 3. For example, if the latest file is 2024/01/10 and --recent-file-count is 2, then only files 2024/01/10 and 2024/01/09 will be used for schema inference, even if --lookback-days is greater than 2. Increase this value to improve schema accuracy over more schema history, at the cost of increased runtime. Must be at least 1.

Example S3 Data Asset: Usage with Apache Airflow

You can use Gable’s CLI to integrate with existing orchestration tools like Apache Airflow. To avoid dependency conflicts it’s recommended to use Airflow’s built-in PythonVirtualenvOperator to install gable in an isolated virtual environment. Below is a simple example DAG with a single PythonVirtualenvOperator task that registers the assets in an S3 bucket called my-cool-bucket:

from datetime import timedelta

from airflow import DAG
from airflow.operators.python import PythonVirtualenvOperator
from airflow.utils.dates import days_ago

# Define default arguments for the DAG
default_args = {
    "owner": "airflow",
    "depends_on_past": False,
    "email_on_failure": False,
    "email_on_retry": False,
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
}

# Define the DAG, its parameters, and schedule
dag = DAG(
    "register_data_assets",
    default_args=default_args,
    description="Register data assets from S3 using Gable CLI",
    schedule_interval=timedelta(hours=1),
    start_date=days_ago(1),
    catchup=False,
    tags=["gable", "example"],
)


def register_in_virtualenv():
    """The code in this function will be run in a Python virtual environment created by the PythonVirtualenvOperator."""
    import os
    import subprocess
    import sys

    # Get the path to the /bin folder of Python virtual environment created by the operator
    VIRTUAL_ENV_PATH = os.path.dirname(os.path.abspath(sys.executable))
    # Prepend the virtualenv's bin directory to the system path so it takes precedence
    PATH = VIRTUAL_ENV_PATH + ":" + os.getenv("PATH", "")
    print(f"Using Python virtual environment: {VIRTUAL_ENV_PATH}")

    # Run the command
    CMD = "gable data-asset register --source-type s3 --bucket my-cool-bucket"
    print("Running command: %s", CMD)
    process = subprocess.Popen(
        f"PATH={PATH} {CMD}",
        shell=True,
        text=True,
    )

    # Wait for the process to complete and check the exit code
    return_code = process.wait()
    if return_code != 0:
        raise subprocess.CalledProcessError(return_code, CMD)


# Task to register data assets using the Gable CLI
register_my_cool_bucket_assets = PythonVirtualenvOperator(
    task_id="register_s3_data_assets",
    # Specify the Python function to run in the virtual environment
    python_callable=register_in_virtualenv,
    # Install the gable package with S3 extras in the virtual environment
    requirements=["gable[s3]"],
    system_site_packages=False,
    dag=dag,
)

Example: Usage with Prefect

You can use Gable’s CLI to integrate with orchestration tools like Prefect. Below is a simple example of a Prefect flow that registers assets in an S3 bucket called my-cool-bucket:

from prefect import task, Flow
from prefect.tasks.shell import ShellTask

@task
def register_data_assets():
    # Define the CLI command
    cli_command = "gable data-asset register --source-type s3 --bucket my-cool-bucket"
    # Execute the CLI command using ShellTask
    shell_task = ShellTask(return_all=True)
    return shell_task(command=cli_command)

# Define the flow
with Flow("Register Data Assets Flow") as flow:
    # Run the task
    data_assets_registration = register_data_assets()

Example: Usage with Dagster

You can use Gable’s CLI to integrate with modern data orchestration tools like Dagster. Below is a simple example of a Dagster job that registers the assets in an S3 bucket called my-cool-bucket:

from dagster import job, op, resource

@resource
def shell_executor():
    import subprocess
    def execute(command):
        try:
            result = subprocess.run(command, shell=True, check=True, capture_output=True, text=True)
            return result.stdout
        except subprocess.CalledProcessError as e:
            return e.output
    return execute

@op(required_resource_keys={'shell_executor'})
def register_data_assets(context):
    cli_command = "gable data-asset register --source-type s3 --bucket my-cool-bucket"
    result = context.resources.shell_executor(cli_command)
    context.log.info(f"Command executed with result: {result}")

@job(resource_defs={'shell_executor': shell_executor})
def data_assets_registration_job():
    register_data_assets()

Getting Started

Data Contracts

Assets

Prevention & Alerting

Settings

Security & Compliance

Resources

Releases

Registering with the CLI

Registering Relational Database Tables

Registering Protobuf, Avro or JSON Schema Files

Static Code Analysis

Python

Register Python Options

PySpark

Register PySpark Options

Typescript

Register Typescript Options

S3

What are S3 Data Assets in Gable?

Supported File Types

Registering S3 Data Assets

Command Usage

Register S3 Options

Getting Started

Data Contracts

Assets

Prevention & Alerting

Settings

Security & Compliance

Resources

Releases

​Registering Relational Database Tables

​Registering Protobuf, Avro or JSON Schema Files

​Static Code Analysis

​Python

​Register Python Options

​PySpark

​Register PySpark Options

​Typescript

​Register Typescript Options

​S3

​What are S3 Data Assets in Gable?

​Supported File Types

​Registering S3 Data Assets

​Command Usage

​Register S3 Options

Registering Relational Database Tables

Registering Protobuf, Avro or JSON Schema Files

Static Code Analysis

Python

Register Python Options

PySpark

Register PySpark Options

Typescript

Register Typescript Options

S3

What are S3 Data Assets in Gable?

Supported File Types

Registering S3 Data Assets

Command Usage

Register S3 Options