Demo Data for fun and profit

See  OLMIS-4658 - Getting issue details... STATUS (and OLMIS-4493 - Getting issue details... STATUS ) for background.

NOTE: When demo data is mentioned, this does not just include data for demos, but also includes data for testing and performance data for testing on large datasets. All of this is called "demo data".

Demo Data Strategy

Data Generation vs. Loading

There are two aspects to demo data: acquiring data (either by generating it or extracting it from another system) and loading it into a demo/test system. Some approaches handle both parts, while others handle only one and need to be paired with other strategies/tools in order to encompass the whole process.

Revisiting Existing Research on Demo Data

The Demo Data Loading Approach document did some research on possible approaches to demo data. Some thoughts:

  • Even though the matrix shows three options, the third option is essentially a much more preferable option to the first option. So there are essentially two options.
  • Both viable options (RefData seed tool and performance data loading mechanism) simply handle only the data loading aspect; both assume that data has already been acquired and put into CSV format.
  • Both options were not researched and tested with large demo datasets. Some quick research done recently seems to indicate that neither will be performant on loading large datasets.
    • The comments section shows that the seed tool will most likely be slow since it does not load data to the database directly, but through APIs
    • Recent testing in stock management indicates loading half a million stock card line items took around 20 minutes for a local instance of stock management and Postgres

Options for Demo Data Generation

  • Mockaroo - can quickly and fairly easily generate demo data. Strong in generic data, which is fine for testing, but not as useful for demo-able data. Our account is limited in how much can be generated manually (100,000 rows at a time), so if we wanted more, we would need to automate generation through the APIs. Output is CSVs stored somewhere for Demo Data Loading.
  • Nifi - most promising. It can work with Mockaroo to get generated data, but can also acquire data through other sources (like an existing demo system); both in an automated fashion. Output could either be CSVs stored somewhere, or no output if Nifi is used for loading ("one-stop shopping"), meaning Nifi's output here would be input for itself to load into a demo database.

Options for Demo Data Loading

  • RefData seed tool - potential with some work, but the biggest issue seems to be performance. With large datasets to load, it is likely to take too long to load, since we are hitting the OpenLMIS APIs to load data.
  • Performance Data loading mechanism - potential, but the biggest issue still seems to be performance, based on some tests. It is using JDBC instead of APIs, so it should be faster, but the inserts may take a long time.
  • Nifi - possible but seems like it requires some effort, as there is no simple mechanism to load templates and run processors on Nifi startup. Seems like it would need to be done with scripts that call Nifi APIs. Some work would also be required to create the templates.
  • SQL direct loading - using Postgres SQL to ingest CSV files directly when deploying. For ref-distro, this could be extending the Postgres image to include SQL scripts and CSV files. For test servers, this could be using run-sql image to execute volume mounted SQL scripts with CSV files.

Recommended Strategy

For generating data, there aren't really multiple options; we would use Nifi with Mockaroo. Mockaroo can be used to quickly generate "mock data" (data that does not need much meaning), and Nifi can be used to extract more meaningful data.

For loading data, based on performance (and data size) considerations, it is recommended to use SQL direct loading for very large datasets with Perf Data loading for all other datasets, based on performance challenges for the RefData seed tool and perf data loading, and other considerations for Nifi.

Nifi Research

This section is research that is more Nifi-centric, rather than demo data oriented.

Nifi seems quite flexible in facilitating the full ETL (extract-transform-load) process for demo data. However, there are two main challenges to it:

  1. Steep learning curve
    1. In how to set up the flows and create templates to get it to do exactly what we want (data ETL to a test server)
    2. In how to use templates we've created and saved in source control to perform an automated deployment of demo data to a server
    3. Additionally, there are not many tutorials and examples in terms of documentation
  2. Not easy to start up with a pre-existing configuration, for automated deployment
    1. This is related to the first point

For data extraction, Nifi seems to be a useful approach, as it can work with Mockaroo to generate new demo data, but also can work with other data sources to extract data (like existing test servers/datastores). This does require getting up to speed on Nifi.

For data loading, Nifi has its challenges and its usefulness here is not as clear. There is a need in the CD process to start up a Nifi instance with "embedded" templates and then to run them in order to take demo data and load it into a database which Nifi knows about. However, there is not an obvious way to start up Nifi with XML template files. There is a templates folder, but it seems to be deprecated as of v1.0. The templates folder may still be possible to use, but needs to be looked into. This does not solve how to get templates into the flow graph. All templates and the flow graph seems to be in a gzipped flow XML file. We would need to replace that file on Nifi container startup. If we need to change the templates and the flow graph, we would need to recreate the gzipped XML file. This could be done through scripts hitting the Nifi APIs, but that would require getting up to speed on Nifi.

Some spiking has been done to:

  • Determine how to load a CSV data file into a database table (only for Fulfillment's Orders table)
  • Determine how to get a list of database tables and extract their data into CSV files (only for Stock Management tables)

These Nifi XML template files are in the Ref Distro repository's nifi folder. Further spiking will need to be done to make these more generic and expand their scope.

Performance Research Details

There seems to be some potential performance issues in loading bigger performance datasets using the performance data loading mechanism we currently have. From extracting stock data from UAT2, a test was done to attempt to load it. This was about 5,000 stock events, 5,000 stock cards, and ~500,000 stock card line items. The loading of data took about 20 minutes on a local install. We'll need to recreate this on a test server and then if still present, two options:

  1. Try to use more SQL in our TestDataInitializer to insert (https://stackoverflow.com/questions/12206600/how-to-speed-up-insertion-performance-in-postgresql)
    1. Do things like dropping indices and constraints before insertion and re-adding them afterwards
  2. Drop down to Postgres' SQL directly and use its \copy from file commands.

A quick spike was done locally for option 2 above. It was much faster, performing an insertion of 7 CSV files with ~2 million rows in 36 seconds.

No spiking was done on the performance of loading data using the RefData seed tool. Since it uses the OpenLMIS APIs to load data, and those APIs currently as a whole are not performant, it is assumed the performance of loading large datasets using it would be quite slow.

Useful Additional Spikes

  • Check if we can provide templates to the Nifi templates folder via a volume and Nifi automatically loads them
  • Check if replacing gzipped flow XML file works for having a pre-configured Nifi server on container startup
  • Recreate on test server with RDS that loading 500,000 stock card line items into stock management takes "too long" (several minutes)

OpenLMIS: the global initiative for powerful LMIS software