SeeĀ  (and ) for background.

NOTE: When demo data is mentioned, this does not just include data for demos, but also includes data for testing and performance data for testing on large datasets. All of this is called "demo data".

Demo Data Strategy

Data Generation vs. Loading

There are two aspects to demo data: acquiring data (either by generating it or extracting it from another system) and loading it into a demo/test system. Some approaches handle both parts, while others handle only one and need to be paired with other strategies/tools in order to encompass the whole process.

Revisiting Existing Research on Demo Data

The Demo Data Loading Approach document did some research on possible approaches to demo data. Some thoughts:

Options for Demo Data Generation

Options for Demo Data Loading

Recommended Strategy

For generating data, there aren't really multiple options; we would use Nifi with Mockaroo. Mockaroo can be used to quickly generate "mock data" (data that does not need much meaning), and Nifi can be used to extract more meaningful data.

For loading data, based on performance (and data size) considerations, it is recommended to use SQL direct loading for very large datasets with Perf Data loading for all other datasets, based on performance challenges for the RefData seed tool and perf data loading, and other considerations for Nifi.

Nifi Research

This section is research that is more Nifi-centric, rather than demo data oriented.

Nifi seems quite flexible in facilitating the full ETL (extract-transform-load) process for demo data. However, there are two main challenges to it:

  1. Steep learning curve
    1. In how to set up the flows and create templates to get it to do exactly what we want (data ETL to a test server)
    2. In how to use templates we've created and saved in source control to perform an automated deployment of demo data to a server
    3. Additionally, there are not many tutorials and examples in terms of documentation
  2. Not easy to start up with a pre-existing configuration, for automated deployment
    1. This is related to the first point

For data extraction, Nifi seems to be a useful approach, as it can work with Mockaroo to generate new demo data, but also can work with other data sources to extract data (like existing test servers/datastores). This does require getting up to speed on Nifi.

For data loading, Nifi has its challenges and its usefulness here is not as clear. There is a need in the CD process to start up a Nifi instance with "embedded" templates and then to run them in order to take demo data and load it into a database which Nifi knows about. However, there is not an obvious way to start up Nifi with XML template files. There is a templates folder, but it seems to be deprecated as of v1.0. The templates folder may still be possible to use, but needs to be looked into. This does not solve how to get templates into the flow graph. All templates and the flow graph seems to be in a gzipped flow XML file. We would need to replace that file on Nifi container startup. If we need to change the templates and the flow graph, we would need to recreate the gzipped XML file. This could be done through scripts hitting the Nifi APIs, but that would require getting up to speed on Nifi.

Some spiking has been done to:

These Nifi XML template files are in the Ref Distro repository's nifi folder. Further spiking will need to be done to make these more generic and expand their scope.

Performance Research Details

There seems to be some potential performance issues in loading bigger performance datasets using the performance data loading mechanism we currently have. From extracting stock data from UAT2, a test was done to attempt to load it. This was about 5,000 stock events, 5,000 stock cards, and ~500,000 stock card line items. The loading of data took about 20 minutes on a local install. We'll need to recreate this on a test server and then if still present, two options:

  1. Try to use more SQL in our TestDataInitializer to insert (https://stackoverflow.com/questions/12206600/how-to-speed-up-insertion-performance-in-postgresql)
    1. Do things like dropping indices and constraints before insertion and re-adding them afterwards
  2. Drop down to Postgres' SQL directly and use its \copy from file commands.

A quick spike was done locally for option 2 above. It was much faster, performing an insertion of 7 CSV files with ~2 million rows in 36 seconds.

No spiking was done on the performance of loading data using the RefData seed tool. Since it uses the OpenLMIS APIs to load data, and those APIs currently as a whole are not performant, it is assumed the performance of loading large datasets using it would be quite slow.

Useful Additional Spikes