gable data-asset register
command. The following data assets are currently supported:
Relational Databases
Postgres
MySQL
MS SQL Server
Protobuf
Avro
JSON Schema
Example
service-one.aaaaaaaaaaaa.eu-west-1.rds.amazonaws.com
database instance using a local Docker Postgres instance as the proxy datbase. The local Docker database first has the service’s migrations applied so its schema mirrors that of production.file1.proto file2.proto
), or as a glob pattern. The register
command must be run within the repository’s directory, as it uses the repo’s git information to construct the unique resource name for the data assets it discovers.
Example
serviceone
.Example: Registering Python Emitter
--source-type
: Specify source. Python in this case--project-root
: Specifying the project’s entry point for proper bundling--emitter-file-path
: Identify the emitter function location--emitter-function
: Identify the emitter function--emitter-payload-parameter
: Identify payload parameter within the emitter function--event-name-key
: Define the property of the event to distinguish event typesExample: Registering Python Emitter (Detailed)
send_analytics_payload
and the parameter we pass into is called payload
._type
property.Now that we’ve narrowed down the emitter function, its payload param name, and the distinguishing payload property, we’re ready to analyze this code.Example: Registering PySpark Projects
--source-type
- Set to pyspark for PySpark projects--project-root
- The directory containing the PySpark job to be analyzed--spark-job-entrypoint
- The command to execute the Spark job, including any argument--connection-string
- Connection string to the Hive metastore--csv-schema-file
- Path to csv file containing the schema of upstream tables, formatted with columns table_name
, column_name
, and column_type
Example Schema CSV Content:
--library segment
. If we don’t yet support the library you use, you can also specify a User Defined Function (UDF), which is a shared function within your code (helper function) that is used to publish events.
Example: Registering Typescript Projects (Supported Library)
Example: Registering Typescript Projects (UDF: Event Name Parameter)
eventName
) is used to set the event name when publishing.Example Event Publishing UDFExample: Registering Typescript Projects (UDF: Event Name Key)
--source-type
- Set to typescript
to register events in Typescript--project-root
- The directory containing the Typescript project to be analyzed--library
- The natively supported library used to publish data, usually events--emitter-file-path src/lib/events.ts
- The path to the file containing the UDF--emitter-function trackEvent
- The name of the UDF--emitter-payload-parameter eventProperties
- The name of the function parameter representing the event payload--emitter-name-parameter eventName
- [Optional] The name of the function parameter representing the event name. Use either this option, or --event-name-key __event_name
. See above examples.--event-name-key __event_name
- [Optional] The name of the event property representing the event name. Use either this option, or --event-name-key __event_name
. See above examples.Example S3 Data Asset:
/{fileType}/{YYYY}/{MM}/{DD}/*.csv
naming convention.Our client will recognize data matching this convention and sample slices of this data to recognize types that can then be registered as Gable assets.One such file in this hypothetical example that we might sample is this example at s3://my-bucket/csv/2024/04/06/shipments.csv
.Contents:Supported Files |
---|
CSV |
JSON |
CSV Types | Gable Types |
---|---|
Number | int |
String | string |
Date | string |
JSON Types | Gable Types |
---|---|
int | int |
string | string |
--source-type
- Set to s3
to register events in S3 bucket--bucket
- Name of bucket in S3--include-prefix
- Specifies what prefixes to include in your S3 bucket. If not specified, all files in the bucket will be analyzed. Multiple include prefixes can be specified like --include-prefix 'path/prefix1' --include-prefix 'path/prefix2'
--exclude-prefix
- Specifies what prefixes to exclude in your S3 bucket. If include prefixes are also specified, then an exclude prefix must be a subset of at least one include prefix. Multiple exclude prefixes can be specified like --exclude-prefix 'path/prefix1' --exclude-prefix 'path/prefix2'
--lookback-days
- Determines the number of days to include before the latest date in the found S3 paths. Defaults to 2. Ex: If the latest path is 2024-01-02, and lookback_days is 3, then the paths analyzed will contain 2024-01-02, 2024-01-01, and 2023-12-31.--history
- This flag allows you to perform an analysis spanning between two specific dates.--skip-profiling
- Turns off data profiling of found schemas. Saves processing time.--row-sample-count
- Specifies number of rows of data per file to sample for schema detection and data profiling. Defaults to 1000. As this number increases, accuracy increases, but processing time and AWS (reading) costs also increase.--recent-file-count
- Specifies the number of most recent files whose schema will be used for inference per data asset. Default is 3. For example, if the latest file is 2024/01/10 and --recent-file-count
is 2, then only files 2024/01/10 and 2024/01/09 will be used for schema inference, even if --lookback-days
is greater than 2. Increase this value to improve schema accuracy over more schema history, at the cost of increased runtime. Must be at least 1.Example S3 Data Asset: Usage with Apache Airflow
gable
in an isolated virtual environment. Below is a simple example DAG with a single PythonVirtualenvOperator
task that registers the assets in an S3 bucket called my-cool-bucket
:Example: Usage with Prefect
my-cool-bucket
:Example: Usage with Dagster
my-cool-bucket
: