It is very useful to develop dashboards and analytics models with live test data. However, it can be a pain to get test data that is free from personal information and have it loaded periodically into a data landing zone. In this guide we share an approach that uses the same Open ETL Tools from the Matatika platform, but we show you how to run these tools without any dependencies on our software whatsoever. See our summary for a few reasons why we think this same test data loading approach is even better when hosted inside our platform.
Help! What do we mean by test data?
Test data should be realistic in volume, shape, and freshness. Often, when developing a new model or dashboard, it is these unexpected elements of the data that make a project unstable or cause stakeholders to lose trust in your analytics because something “doesn’t look right”. Keep their trust by developing with realistic, but safe to use and sanitised, test data.
In this example test data project, we need to load data from Salesforce to a Postgres data warehouse. This Postgres database is likely a business intelligence optimised version of Postgres such as Alloy DB – Google developed Alloy DB to automatically tune your instance for data warehouse performance, scale, and availability. Bear in mind that this same test data load technique could be used for Snowflake, Big Query, Oracle, DuckDB or dozens of other target databases that Matatika supports.
We’ve created this Salesforce to Postgres example project to make it easy to select your streams and obfuscate only your sensitive columns. We know that it’s always difficult to create a one-size-fits-all project, so here we show a few different examples depending on your use case.
When loading test data, we often see that clients only need to load a few tables or objects rather than everything. In our platform, we call these tables or objects “streams”.
To discover streams for selection, run:
meltano select tap-salesforce --list --all
|
The output of this command can be quite large, so it may be useful to redirect to a file and grep (or use a text editor to search) for properties you are interested in:
meltano select tap-salesforce --list --all > select.out
grep <pattern> select.out
|
Using meltano select, we can include the specific streams we want to extract from Salesforce.
meltano select tap-salesforce LoginHistory meltano select tap-salesforce LoginGeo |
This will update your meltano.yml to include the LoginHistory and LoginGeo streams:
meltano.yml |
plugins:
extractors:
- name: tap-salesforce
variant: matatika
pip_url: git+https://github.com/Matatika/[email protected]
select:
- 'LoginHistory.*' # include stream (all properties)
- 'LoginGeo.*'
|
This will extract the LoginHistory and LoginGeo streams only. When running with a database loader like target-postgres, this means that only LoginHistory and LoginGeo tables will be created.
Run the sync:
meltano install meltano run --no-state-update tap-salesforce target-postgres |
We can also use meltano select to exclude individual properties of a stream that contain sensitive information. The following commands update your meltano.yml automatically:
meltano select tap-salesforce --exclude LoginHistory SourceIp
|
This will update your meltano.yml to exclude the SourceIp property of the LoginHistory stream:
meltano.yml |
plugins:
extractors:
- name: tap-salesforce
variant: matatika
pip_url: git+https://github.com/Matatika/[email protected]
select:
- 'LoginHistory.*' # include stream (all properties)
- '!LoginHistory.SourceIp' # exclude specific property of stream
|
This will extract all properties of the LoginHistory stream, excluding SourceIp. In the case of running with a database loader like target-postgres, this means that the SourceIp column will not be created for the LoginHistory table.
Run the sync:
meltano install meltano run --no-state-update tap-salesforce target-postgres |
See here for more information on selecting data for extraction with Meltano.
An alternative to meltano select for removing sensitive data is using the meltano-map-transformer mapper plugin. This plugin can be used to apply inline transformations to extracted data before it is loaded, such as to remove or obfuscate sensitive data in this case. Note that using meltano-map-transformer will add an extra dependency to your project.
To remove a property of a stream using meltano-map-transformer, supply the following stream_maps configuration for a named mapping:
meltano.yml |
plugins:
mappers:
- name: meltano-map-transformer
variant: meltano
pip_url: git+https://github.com/MeltanoLabs/meltano-map-transform.git
mappings:
- name: salesforce-obfuscate-sensitive-data
config:
stream_maps:
LoginHistory:
SourceIp: __NULL__ # remove property of stream
|
This is functionally equivalent to the !LoginHistory.SourceIp exclude select above.
Run the sync:
meltano install meltano run --no-state-update tap-salesforce salesforce-obfuscate-sensitive-data target-postgres |
To set NULL values for a property of a stream using meltano-map-transformer, supply the following stream_maps configuration for a named mapping:
meltano.yml |
plugins:
mappers:
- name: meltano-map-transformer
variant: meltano
pip_url: git+https://github.com/MeltanoLabs/meltano-map-transform.git
mappings:
- name: salesforce-obfuscate-sensitive-data
config:
stream_maps:
LoginGeo:
Latitude: None # set `NULL` for property of stream
Longitude: None
|
Run the sync:
meltano install meltano run --no-state-update tap-salesforce salesforce-obfuscate-sensitive-data target-postgres |
To set static mock values for a property of a stream using meltano-map-transformer, supply the following stream_maps configuration for a named mapping:
meltano.yml |
plugins:
mappers:
- name: meltano-map-transformer
variant: meltano
pip_url: git+https://github.com/MeltanoLabs/meltano-map-transform.git
mappings:
- name: salesforce-obfuscate-sensitive-data
config:
stream_maps:
LoginGeo:
City: '"Berkhamsted"' # set mock value for property of stream
PostalCode: '"HP4 1EH"'
Subdivision: '"Hertfordshire"'
|
See here for how to construct more complex mapping expressions for dynamic mock values.
Run the sync:
meltano install meltano run --no-state-update tap-salesforce salesforce-obfuscate-sensitive-data target-postgres |
In an ideal world you’d already have access to Matatika’s amazing technology, but sometimes your IT security just hasn’t got up to speed yet or you need to move a bit faster for some other reason. Fear not – it’s very easy to host this same project inside your Azure, AWS or Google Cloud and then move to a Matatika hosted project later. See the conclusion at the end of this post for our opinion why Matatika’s the best option!
docker login "$REGISTRY_SERVER_NAME" \
-u "$REGISTRY_USERNAME" \
-p "$REGISTRY_PASSWORD"
docker build -t "$REGISTRY_SERVER_NAME/salesforce-to-postgres" .
docker push "$REGISTRY_SERVER_NAME/salesforce-to-postgres"
|
We can create an Azure Container App job to run the sync in Azure:
ENVIRONMENT=<environment name> RESOURCE_GROUP=<resource group name> LOCATION='UK South' az login az upgrade az extension add --name containerapp --upgrade az provider register --namespace Microsoft.App az provider register --namespace Microsoft.OperationalInsights az containerapp env create \ --name "$ENVIRONMENT" \ --resource-group "$RESOURCE_GROUP" \ --location "$LOCATION" |
ENVIRONMENT=<environment name> RESOURCE_GROUP=<resource group name> JOB_NAME='Salesforce to Postgres' ACR_SERVER_NAME=<Azure container registry server name> ACR_USERNAME=<Azure container registry username> ACR_PASSWORD=<Azure container registry password>IMAGE="$ACR_SERVER_NAME/salesforce-to-postgres:latest"#TRIGGER_TYPE=Manual TRIGGER_TYPE=ScheduleSCHEDULE=’<cron expression>’SECRETS="\ salesforce.password=<password> \ salesforce.securityToken=<security token> \ postgres.password=<password>" ENV_VARS="\ TAP_SALESFORCE_USERNAME=<username> \ TAP_SALESFORCE_PASSWORD=secretref:salesforce.password \ TAP_SALESFORCE_SECURITY_TOKEN=secretref:salesforce.securityToken \ TARGET_POSTGRES_HOST=<host> \ TARGET_POSTGRES_PORT=<port> \ TARGET_POSTGRES_USER=<user> \ TARGET_POSTGRES_PASSWORD=secretref:postgres.password \ TARGET_POSTGRES_DATABASE=<database name>" az containerapp job delete \ –yes \ --name "$JOB_NAME" \ --resource-group "$RESOURCE_GROUP" az containerapp job create \ --name "$JOB_NAME" \ --resource-group "$RESOURCE_GROUP" \ --environment "$ENVIRONMENT" \ --registry-server "$ACR_SERVER_NAME" \ --registry-username "$ACR_USERNAME" \ --registry-password "$ACR_PASSWORD" \ --image "$IMAGE" \ --secrets $SECRETS \ --env-vars $ENV_VARS \ --trigger-type $TRIGGER_TYPE \ --cron-expression "$SCHEDULE" \ --cpu 1 \ --memory 2Gi \ --replica-timeout 1800 \ --replica-retry-limit 0 \ --replica-completion-count 1 \ --parallelism 1 |
You can override the selected streams in the project meltano.yml by providing a JSON array of select patterns for TAP_SALESFORCE__SELECT in ENV_VARS. You can do the same for any Meltano settings.
You can create a release pipeline to automatically build the project image and push to a container registry. Simply set the docker build commands above as the inline script for a Bash task and configure the continuous deployment trigger on push to the main branch.
Release pipelines can also be used to deploy Azure Container App jobs in various configurations, using the container app commands above as the inline script for an Azure CLI task.
Hosting a project that continuously updates your test data for development is a great idea. This way your dashboards come to life and stay alive with real data, but without the risk of any data breach or data escaping during development. Hosting this project yourself is potentially a way forward that keeps IT happy until you have permission to host with the full power of the Matatika Platform, but in our experience you are missing out on a few things.
It takes a lot longer to setup
To set up the test project, you need to build a container, push the container, create cloud roles to create and run the job, and create logging infrastructure if you want any kind of debugging. In Matatika, deployment from an existing Git repository takes moments.
It’s more difficult to configure
Within Matatika we make the pipeline and project settings easy to configure, and manage different environments. We also take care of storing all sensitive settings in fields that are encrypted at rest.
It’s more difficult for others to launch the job
Getting access to the Cloud environment becomes yet another request to IT to provide access. Fine grained access control can be a pain to set up and almost impossible to restrict access to just launching the job.
It’s more difficult to debug
Using Azure Log Analytics, we found two things particularly painful:
In Matatika, the stream of logs is available directly in our Lab and the whole project can be run locally with ease.
Enhance your test load capabilities with our performance monitoring, automated schedules, and data quality reporting
As part of our platform we gather statistics on when the jobs are run, how much data is flowing through, and we make data observability a breeze by providing Elementary-based data quality reporting by simply installing the plugin into our platform.
Interested in learning more? Get in touch and speak with one of our experts to see how we can help your much your business.
Stay up to date with our insights as they become available.