Your guide to loading quality test data

It is very useful to develop dashboards and analytics models with live test data. However, it can be a pain to get test data that is free from personal information and have it loaded periodically into a data landing zone. In this guide we share an approach that uses the same Open ETL Tools from the Matatika platform, but we show you how to run these tools without any dependencies on our software whatsoever. See our summary for a few reasons why we think this same test data loading approach is even better when hosted inside our platform.

 

Help! What do we mean by test data?

Test data should be realistic in volume, shape, and freshness. Often, when developing a new model or dashboard, it is these unexpected elements of the data that make a project unstable or cause stakeholders to lose trust in your analytics because something “doesn’t look right”. Keep their trust by developing with realistic, but safe to use and sanitised, test data.

 

Example test data project

In this example test data project, we need to load data from Salesforce to a Postgres data warehouse. This Postgres database is likely a business intelligence optimised version of Postgres such as Alloy DB – Google developed Alloy DB to automatically tune your instance for data warehouse performance, scale, and availability. Bear in mind that this same test data load technique could be used for Snowflake, Big Query, Oracle, DuckDB or dozens of other target databases that Matatika supports.

 

We’ve created this Salesforce to Postgres example project to make it easy to select your streams and obfuscate only your sensitive columns. We know that it’s always difficult to create a one-size-fits-all project, so here we show a few different examples depending on your use case.

Discover streams

When loading test data, we often see that clients only need to load a few tables or objects rather than everything.  In our platform, we call these tables or objects “streams”.

 

To discover streams for selection, run:

 

meltano select tap-salesforce --list --all

 

The output of this command can be quite large, so it may be useful to redirect to a file and grep (or use a text editor to search) for properties you are interested in:

 

meltano select tap-salesforce --list --all > select.out
grep <pattern> select.out

 

Include streams

Using meltano select, we can include the specific streams we want to extract from Salesforce.

 

meltano select tap-salesforce LoginHistory

meltano select tap-salesforce LoginGeo

 

This will update your meltano.yml to include the LoginHistory and LoginGeo streams:

 

meltano.yml

 

plugins:
  extractors:
  - name: tap-salesforce
    variant: matatika
    pip_url: git+https://github.com/Matatika/[email protected]
    select:
    - 'LoginHistory.*'  # include stream (all properties)
    - 'LoginGeo.*'

 

This will extract the LoginHistory and LoginGeo streams only. When running with a database loader like target-postgres, this means that only LoginHistory and LoginGeo tables will be created.

 

Run the sync:

 

meltano install

meltano run --no-state-update tap-salesforce target-postgres

Exclude properties of streams

We can also use meltano select to exclude individual properties of a stream that contain sensitive information. The following commands update your meltano.yml automatically:

 

 meltano select tap-salesforce --exclude LoginHistory SourceIp

 

This will update your meltano.yml to exclude the SourceIp property of the LoginHistory stream:

 

meltano.yml

 

plugins:
  extractors:
  - name: tap-salesforce
    variant: matatika
    pip_url: git+https://github.com/Matatika/[email protected]
    select:
    - 'LoginHistory.*'  # include stream (all properties)
    - '!LoginHistory.SourceIp'  # exclude specific property of stream

 

This will extract all properties of the LoginHistory stream, excluding SourceIp. In the case of running with a database loader like target-postgres, this means that the SourceIp column will not be created for the LoginHistory table.

 

Run the sync:

 

meltano install

meltano run --no-state-update tap-salesforce target-postgres

 

See here for more information on selecting data for extraction with Meltano.

Remove sensitive properties

An alternative to meltano select for removing sensitive data is using the meltano-map-transformer mapper plugin. This plugin can be used to apply inline transformations to extracted data before it is loaded, such as to remove or obfuscate sensitive data in this case. Note that using meltano-map-transformer will add an extra dependency to your project.

 

To remove a property of a stream using meltano-map-transformer, supply the following stream_maps configuration for a named mapping:

 

meltano.yml

 

plugins:
  mappers:
  - name: meltano-map-transformer
    variant: meltano
    pip_url: git+https://github.com/MeltanoLabs/meltano-map-transform.git
    mappings:
    - name: salesforce-obfuscate-sensitive-data
      config:
        stream_maps:
          LoginHistory:
            SourceIp: __NULL__  # remove property of stream

 

This is functionally equivalent to the !LoginHistory.SourceIp exclude select above.

 

Run the sync:

 

meltano install

meltano run --no-state-update tap-salesforce salesforce-obfuscate-sensitive-data target-postgres

Set NULL values for sensitive properties

To set NULL values for a property of a stream using meltano-map-transformer, supply the following stream_maps configuration for a named mapping:

 

meltano.yml

 

plugins:
  mappers:
  - name: meltano-map-transformer
    variant: meltano
    pip_url: git+https://github.com/MeltanoLabs/meltano-map-transform.git
    mappings:
    - name: salesforce-obfuscate-sensitive-data
      config:
        stream_maps:
          LoginGeo:
            Latitude: None  # set `NULL` for property of stream
            Longitude: None

 

Run the sync:

 

meltano install

meltano run --no-state-update tap-salesforce salesforce-obfuscate-sensitive-data target-postgres

Set mock values for sensitive properties

To set static mock values for a property of a stream using meltano-map-transformer, supply the following stream_maps configuration for a named mapping:

meltano.yml

 

plugins:
  mappers:
  - name: meltano-map-transformer
    variant: meltano
    pip_url: git+https://github.com/MeltanoLabs/meltano-map-transform.git
    mappings:
    - name: salesforce-obfuscate-sensitive-data
      config:
        stream_maps:
          LoginGeo:
            City: '"Berkhamsted"'  # set mock value for property of stream
            PostalCode: '"HP4 1EH"'
            Subdivision: '"Hertfordshire"'

 

See here for how to construct more complex mapping expressions for dynamic mock values.

 

Run the sync:

 

meltano install

meltano run --no-state-update tap-salesforce salesforce-obfuscate-sensitive-data target-postgres

 

Hosting the project

In an ideal world you’d already have access to Matatika’s amazing technology, but sometimes your IT security just hasn’t got up to speed yet or you need to move a bit faster for some other reason. Fear not – it’s very easy to host this same project inside your Azure, AWS or Google Cloud and then move to a Matatika hosted project later. See the conclusion at the end of this post for our opinion why Matatika’s the best option!

Build the project image and push to a container registry

docker login "$REGISTRY_SERVER_NAME" \
  -u "$REGISTRY_USERNAME" \
  -p "$REGISTRY_PASSWORD"
docker build -t "$REGISTRY_SERVER_NAME/salesforce-to-postgres" .
docker push "$REGISTRY_SERVER_NAME/salesforce-to-postgres"

 

Create an Azure Container App job

We can create an Azure Container App job to run the sync in Azure:

Setup

ENVIRONMENT=<environment name>

RESOURCE_GROUP=<resource group name>

LOCATION='UK South'

az login
az upgrade
az extension add --name containerapp --upgrade
az provider register --namespace Microsoft.App
az provider register --namespace Microsoft.OperationalInsights

az containerapp env create \
  --name "$ENVIRONMENT" \
  --resource-group "$RESOURCE_GROUP" \
  --location "$LOCATION"

 

Create

ENVIRONMENT=<environment name>
RESOURCE_GROUP=<resource group name>
JOB_NAME='Salesforce to Postgres'
ACR_SERVER_NAME=<Azure container registry server name>
ACR_USERNAME=<Azure container registry username>
ACR_PASSWORD=<Azure container registry password>IMAGE="$ACR_SERVER_NAME/salesforce-to-postgres:latest"#TRIGGER_TYPE=Manual
TRIGGER_TYPE=ScheduleSCHEDULE=<cron expression>SECRETS="\  salesforce.password=<password> \
  salesforce.securityToken=<security token> \
  postgres.password=<password>"
ENV_VARS="\
  TAP_SALESFORCE_USERNAME=<username> \  TAP_SALESFORCE_PASSWORD=secretref:salesforce.password \

  TAP_SALESFORCE_SECURITY_TOKEN=secretref:salesforce.securityToken \
  TARGET_POSTGRES_HOST=<host> \

  TARGET_POSTGRES_PORT=<port> \

  TARGET_POSTGRES_USER=<user> \

  TARGET_POSTGRES_PASSWORD=secretref:postgres.password \

  TARGET_POSTGRES_DATABASE=<database name>"

az containerapp job delete \

  –yes \

  --name "$JOB_NAME" \

  --resource-group "$RESOURCE_GROUP"


az containerapp job create \
  --name "$JOB_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  --environment "$ENVIRONMENT" \
  --registry-server "$ACR_SERVER_NAME" \
  --registry-username "$ACR_USERNAME" \
  --registry-password "$ACR_PASSWORD" \
  --image "$IMAGE" \

  --secrets $SECRETS \
  --env-vars $ENV_VARS \
  --trigger-type $TRIGGER_TYPE \
  --cron-expression "$SCHEDULE" \
  --cpu 1 \
  --memory 2Gi \
  --replica-timeout 1800 \
  --replica-retry-limit 0 \
  --replica-completion-count 1 \
  --parallelism 1

 

You can override the selected streams in the project meltano.yml by providing a JSON array of select patterns for TAP_SALESFORCE__SELECT in ENV_VARS. You can do the same for any Meltano settings.

Deploy changes with Azure DevOps release pipelines

You can create a release pipeline to automatically build the project image and push to a container registry. Simply set the docker build commands above as the inline script for a Bash task and configure the continuous deployment trigger on push to the main branch.

Release pipelines can also be used to deploy Azure Container App jobs in various configurations, using the container app commands above as the inline script for an Azure CLI task.

 

Conclusion

Hosting a project that continuously updates your test data for development is a great idea. This way your dashboards come to life and stay alive with real data, but without the risk of any data breach or data escaping during development. Hosting this project yourself is potentially a way forward that keeps IT happy until you have permission to host with the full power of the Matatika Platform, but in our experience you are missing out on a few things.

 

It takes a lot longer to setup

To set up the test project, you need to build a container, push the container, create cloud roles to create and run the job, and create logging infrastructure if you want any kind of debugging. In Matatika, deployment from an existing Git repository takes moments.

 

It’s more difficult to configure

Within Matatika we make the pipeline and project settings easy to configure, and manage different environments. We also take care of storing all sensitive settings in fields that are encrypted at rest.

 

It’s more difficult for others to launch the job

Getting access to the Cloud environment becomes yet another request to IT to provide access. Fine grained access control can be a pain to set up and almost impossible to restrict access to just launching the job.

 

It’s more difficult to debug

Using Azure Log Analytics, we found two things particularly painful:

  1. The logs are difficult to search and format – if you’ve used it you know what I mean. Horrible.
  2. The logs arrive after a minimum of 5 minutes and are not guaranteed to ever arrive. Debugging takes forever in this environment.

In Matatika, the stream of logs is available directly in our Lab and the whole project can be run locally with ease.

 

Enhance your test load capabilities with our performance monitoring, automated schedules, and data quality reporting

As part of our platform we gather statistics on when the jobs are run, how much data is flowing through, and we make data observability a breeze by providing Elementary-based data quality reporting by simply installing the plugin into our platform.

#Blog #analytics #collection #data #documents #esg #ETL Tools #LLM #reporting #search

Interested in learning more?

Get in touch