Demo Data Loading Approach

Evaluation matrix for different approaches

Options	No manual UUID generation	Demo data is in ubiquitous export format (CSV, not JSON)	Foreign key validation / manage mappings	Developer coding effort	Non-developer to add/edit demo data	Developer to add/edit demo data (maintenance)	Other things to note
Embed RefData Seed Tool into each microservice	Yes	Yes	Yes	High	Yes	Low/Medium	Because generated UUIDs would be different each time, unknown if any integration or contract tests depend on certain UUIDs
Performance data loading mechanism	Yes	Yes	Yes	Low	Yes, but need to keep it in sync	Low/Medium	There's some maintenance to keep demo data synced in Github and Mockaroo
RefData Seed Tool as new microservice	Yes	Yes	Yes	Low/Medium?	Yes	Low/Medium	This option has not been researched in depth; this is just an initial guess How to wait until services are available before trying to add demo data
Nifi

Problem Statement

Recently designed and implemented new demo data (for EPI/vaccines) to load on startup for each service (see Vaccine Demo Data)
Current demo data loading approach (for each microservice)
- JSON files manually updated
- On gradle task (demoDataSeed), these JSON files are turned into a SQL file that has INSERTs on a (potentially large) JSON blob
- Flyway picks up this SQL file and runs it after all migrations on service startup, if Spring profile demo-data is set
Pain points in current approach
- Needing to generate own UUIDs when adding JSON (uuidgenerator.net)
- Needing to convert from (CSV) format to JSON to add to our demo data files
  - Most tools we use (pgAdmin, psql, Mockaroo), can easily export to CSV, but not to JSON
- Manually need to ensure there are no foreign key violations
- Adding a new JSON file (for a new table), need to make sure loading order is correct
  - In demo-data folder, we have a generator.js file that specifies JSON file order to load into db. Any JSON files that are not specified in this JavaScript file will be loaded into db non-deterministically. Which means that demo data loading might fail on service startup because of foreign key issues.
- Virtually impossible for a non-developer to add/edit demo data because the JSON format is hard for a human to read/parse
In recent adding of new vaccine demo data, it took one developer about 3-4 full days to do, with more than half of that time dealing with the first two pain points

Options for a New Approach

Option 1: Merge RefData Seed Tool into services

Incorporate RefData seed tool into each service that needs demo data
- https://github.com/OpenLMIS/openlmis-refdata-seed
- Is currently a separate Spring Boot application that runs when the OpenLMIS system has finished starting up
- Loads CSV files defined in SourceFile.java to create/update to corresponding resources
- Uses APIs to create/update resources (not tables directly)
- Does data validation as well as data loading, not sure what this validation encompasses
- Flexible in configuring CSV files for import, through mapping files (column names don't have to match database, can use a default value instead of a value in the CSV file, can skip columns)
Advantages over current approach, with estimate of advantage (small, medium, large)
- UUID generation is handled by Java code, since it uses APIs
  - Large - this is very helpful
- Using CSV files to load data
  - Large - this is very helpful
- Updating data is relatively simple, editing the CSV files
  - Medium - this is better than editing JSON
- Data validation, not sure extent of validation
  - Small - Demo data errors can be fixed. Once it is established, it shouldn't need to be constantly validated.
- Supports mapping between resources, which field is configurable (by id, code, etc.)
  - Small/Medium - mapping is useful, but we don't need configurable mapping, as long as we can map data to each other by UUID
- Configurable if resources are allowed to be updated
  - Small - we don't care about updating resources, as demo data is generally assumed to be added to a "fresh" system
- Flexible in configuring CSV files for import, through mapping files
  - Small - we don't need flexible configuration, as long as the CSV files have the exact demo data we want to load into the system
Disadvantages over current approach, with estimate of disadvantage
- Requires access to authentication/authorization and configuration for that, since it uses APIs
  - Large - either need to figure out how to wait until system is fully up to get auth token, or refactor code to not use APIs, which would require getting resource repository beans (and waiting long enough so that the beans exist)
- Adding/removing resources to load would require modifying Java code (SourceFile.java)
  - Small/Medium - would need to figure out how to determine resources at runtime
- UUIDs for demo data would be re-generated each time the service starts again, since we drop the database before reloading, need to refactor tests that assume certain demo data UUIDs exist in the system
  - Small/Medium?
Advantages over the other option
Other notables
- Has unit tests
- This tool is particularly useful for reference data, as most reference data has codes. But since other services do not use codes to identify resources, will need to use not human-readable UUIDs for those

Option 2: Performance Data loading mechanism

Use same data loading mechanism that we use to load performance data into each service
- CommandLineRunner that runs on a Spring profile
- CSV files in the classpath that are referenced by the Runner and inserted using Resource2Db into a specified table
- Bypasses Hibernate, uses JDBC directly
- Uses Mockaroo to generate new data, either new CSV files or new rows to existing CSV files
- CSV files are checked into source control
- Would be more flexible if the Runner just looks for all files in a directory and processes them, assuming the file name matches the table name
Advantages over current approach, with estimate of advantage (small, large)
- UUID generation is handled by Mockaroo
  - Large
- Using CSV files to load data, since Mockaroo can export to CSV
  - Large
- Can handle mapping between demo data, as long as demo datasets are stored in Mockaroo
  - Small/Medium - we don't have to manually manage the mapping
- Updating demo data is relatively simple; just editing the CSV files and keeping in sync with Mockaroo
  - Medium - this is better than editing JSON
Disadvantages over current approach, with estimate of disadvantage
- Need to make sure CSV files in source control and Mockaroo are in sync (Mockaroo needs them in order to be able to generate more data properly)
  - Small
- Still need to modify Java code (the Runner) in order to add/edit demo data
  - Small - it is easy enough to add a new entry to the Java code
Advantages over the other option
- Don't need the full system to be started up (auth, APIs)
  - Large
- Mockaroo is likely more flexible in how you can generate these CSV files, as you can define formulas to generate data (https://www.mockaroo.com/help/formulas)
  - Small/Medium
- If the Runner is flexible enough to just process all CSV files in a directory, then we don't even need to modify the Java
Other notables
- Resource2Db has unit tests
- By tables, not resources, so will need to have entries for mapping/association tables (may potentially be a disadvantage over other option)

Recommendation

If there are no other viable options, the recommendation would be to copy option 2, use the performance data loading mechanism.
Both options have big benefit over the current approach, but option 1 has the problem of requiring auth and the APIs to be present. Also a lot of its configurability is not necessary for loading demo data.
The main drawback to option 2 seems to be keeping the demo data CSV files in sync between source control and Mockaroo.