Gable supports explicitly registering data assets with our Python CLI using the gable data-asset register command. The following data assets are currently supported: Relational Databases
  • Postgres
  • MySQL
  • MS SQL Server
Serialization Formats
  • Protobuf
  • Avro
  • JSON Schema
Static Code Analysis

Registering Relational Database Tables

The same strict security requirements that make it difficult to automatically scrape data assets from production relational databases often also apply to explicitly registering tables as data assets. To enable registration without the need to connect to a production database instance, Gable relies on the concept of a “proxy” database. A proxy database is a database instance that mirrors your production database’s schema, but is accessible locally or in your CI/CD environment. Instead of connecting to a production database, Gable can be configured to connect to its proxy instead, registering any tables it finds as if it were connected to production. The proxy database concept removes the need to grant access to your production data, as well as eliminating the possibility of impacting the performance of your production database in any way. A proxy database can be a local Docker container, a Docker container that is spun up in your CI/CD workflow, or a database instance in a test/staging environment. The tables in the proxy database must have the same schema as your production database for all tables to be correctly registered, so if you’re registering tables as data assets in your CI/CD workflows it’s important to only do so from your main branch. If you already start a database Docker container in your CI/CD workflows for integration testing, Gable can be configured to use that same container at the end of the test run. When using a proxy database, you specify both the production host/port/schema, as well as those of the proxy. The production information is required to compute the unique data asset resource name for each discovered table so they can be registered as if they came from the production database.

Registering Protobuf, Avro or JSON Schema Files

Registering a service’s Protobuf, Avro, or JSON schema files is straightforward as the only requirement is having the service’s git repository checked out locally. The CLI supports registering multiple files, either as a space separated list (file1.proto file2.proto), or as a glob pattern. The register command must be run within the repository’s directory, as it uses the repo’s git information to construct the unique resource name for the data assets it discovers.

Static Code Analysis

Gable can automatically identify and register data-generating code across your codebase for a comprehensive view of your first-party data assets. Using static code analysis, Gable recognizes code that generates data payloads, capturing the structure and types as a re-usable data asset definition. This feature supports Python, PySpark, and Typescript and we’ll soon be expanding support to other languages. Following the examples below, you can use the Gable CLI to have your code bundled, transmitted, and analyzed by Gable for native type detection and asset registration. Please note that bundling and transmission of your code is necessary for our Gable static analysis, but rest assured that your code will not be persisted on Gable servers. In future releases, we will add the ability to run the static code analysis completely within your CI/CD pipeline.

Python

When registering a Python project, it’s important to specify the project’s entry point, which is the root directory of your project. This allows Gable to correctly identify and bundle the project for analysis. Additionally, specifying the emitter function and event name key helps Gable understand how your project interacts with and emits data, ensuring accurate tracking and management.

Register Python Options

  • --source-type: Specify source. Python in this case
  • --project-root: Specifying the project’s entry point for proper bundling
  • --emitter-file-path: Identify the emitter function location
  • --emitter-function: Identify the emitter function
  • --emitter-payload-parameter: Identify payload parameter within the emitter function
  • --event-name-key: Define the property of the event to distinguish event types
It’s crucial to ensure that the information provided reflects the actual state of your project in your production environment. When using these options as part of your CI/CD workflow, make sure to register your Python projects from the main branch. Registering from feature or development branches may lead to inconsistencies and inaccuracies in the data asset registry, as these branches may contain code that is not yet, or may never be, deployed to production.

PySpark

Gable’s support for PySpark projects enables you to analyze and register data assets from PySpark jobs. This addition caters to a wider range of data engineering and analysis workflows, enhancing data asset management within your organization.

Register PySpark Options

  • --source-type - Set to pyspark for PySpark projects
  • --project-root - The directory containing the PySpark job to be analyzed
  • --spark-job-entrypoint - The command to execute the Spark job, including any argument
  • --connection-string - Connection string to the Hive metastore
  • --csv-schema-file - Path to csv file containing the schema of upstream tables, formatted with columns table_name, column_name, and column_type

Typescript

Gable’s integration of the Typescript Data Assets feature allows for the registration and management of data assets for Typescript projects. This enhancement is specifically designed to support data asset tracking in software development workflows, augmenting the control and consistency of data usage across your applications. Gable natively supports some common event publishing libraries like Segment & Amplitude making registration as simple as specifying --library segment. If we don’t yet support the library you use, you can also specify a User Defined Function (UDF), which is a shared function within your code (helper function) that is used to publish events.

Register Typescript Options

Required
  • --source-type - Set to typescript to register events in Typescript
  • --project-root - The directory containing the Typescript project to be analyzed
Supported Libraries
  • --library - The natively supported library used to publish data, usually events
User Defined Function
  • --emitter-file-path src/lib/events.ts - The path to the file containing the UDF
  • --emitter-function trackEvent - The name of the UDF
  • --emitter-payload-parameter eventProperties - The name of the function parameter representing the event payload
  • --emitter-name-parameter eventName - [Optional] The name of the function parameter representing the event name. Use either this option, or --event-name-key __event_name. See above examples.
  • --event-name-key __event_name - [Optional] The name of the event property representing the event name. Use either this option, or --event-name-key __event_name. See above examples.

S3

Gable now extends its asset registration capabilities to include data assets stored in AWS S3. This functionality is crucial for organizations managing data across distributed storage systems and aims to streamline the integration of S3 data into your data governance framework.

What are S3 Data Assets in Gable?

Managing large volumes of data in S3 and understanding changes to that data can be challenging. Gable helps you better understand your structured data by running its targeted inference algorithm against your S3 bucket. Gable evaluates the structured data in your S3 bucket by inferring the naming conventions you’re utilizing. We then sample data from matching files to build an understanding of the types within these files and register them as a Gable data asset, allowing you to detect and respond to unexpected changes in your S3 bucket.

Supported File Types

Supported Files
CSV
JSON
CSV TypesGable Types
Numberint
Stringstring
Datestring
JSON TypesGable Types
intint
stringstring

Registering S3 Data Assets

To register S3 data assets, you can use the gable data-asset register command with specific parameters tailored to S3 sources.

Command Usage

To register data assets from S3, specify the source type as s3, along with the bucket name. Here is the command structure:
gable data-asset register --source-type s3 --bucket <bucket_name>

Register S3 Options

Required
  • --source-type - Set to s3 to register events in S3 bucket
  • --bucket - Name of bucket in S3
Optional
  • --include-prefix - Specifies what prefixes to include in your S3 bucket. If not specified, all files in the bucket will be analyzed. Multiple include prefixes can be specified like --include-prefix 'path/prefix1' --include-prefix 'path/prefix2'
  • --exclude-prefix - Specifies what prefixes to exclude in your S3 bucket. If include prefixes are also specified, then an exclude prefix must be a subset of at least one include prefix. Multiple exclude prefixes can be specified like --exclude-prefix 'path/prefix1' --exclude-prefix 'path/prefix2'
  • --lookback-days - Determines the number of days to include before the latest date in the found S3 paths. Defaults to 2. Ex: If the latest path is 2024-01-02, and lookback_days is 3, then the paths analyzed will contain 2024-01-02, 2024-01-01, and 2023-12-31.
  • --history - This flag allows you to perform an analysis spanning between two specific dates.
  • --skip-profiling - Turns off data profiling of found schemas. Saves processing time.
  • --row-sample-count - Specifies number of rows of data per file to sample for schema detection and data profiling. Defaults to 1000. As this number increases, accuracy increases, but processing time and AWS (reading) costs also increase.
  • --recent-file-count - Specifies the number of most recent files whose schema will be used for inference per data asset. Default is 3. For example, if the latest file is 2024/01/10 and --recent-file-count is 2, then only files 2024/01/10 and 2024/01/09 will be used for schema inference, even if --lookback-days is greater than 2. Increase this value to improve schema accuracy over more schema history, at the cost of increased runtime. Must be at least 1.