Demo Data Loading Approach
Evaluation matrix for different approaches
Options | No manual UUID generation | Demo data is in ubiquitous export format (CSV, not JSON) | Foreign key validation / manage mappings | Developer coding effort | Non-developer to add/edit demo data | Developer to add/edit demo data (maintenance) | Other things to note |
|---|---|---|---|---|---|---|---|
Embed RefData Seed Tool into each microservice | Yes | Yes | Yes | High | Yes | Low/Medium | Because generated UUIDs would be different each time, unknown if any integration or contract tests depend on certain UUIDs |
Performance data loading mechanism | Yes | Yes | Yes | Low | Yes, but need to keep it in sync | Low/Medium | There's some maintenance to keep demo data synced in Github and Mockaroo |
RefData Seed Tool as new microservice | Yes | Yes | Yes | Low/Medium? | Yes | Low/Medium | This option has not been researched in depth; this is just an initial guess
|
Nifi |
|
|
|
|
|
|
|
Problem Statement
Recently designed and implemented new demo data (for EPI/vaccines) to load on startup for each service (see Vaccine Demo Data)
Current demo data loading approach (for each microservice)
JSON files manually updated
On gradle task (demoDataSeed), these JSON files are turned into a SQL file that has INSERTs on a (potentially large) JSON blob
Flyway picks up this SQL file and runs it after all migrations on service startup, if Spring profile demo-data is set
Pain points in current approach
Needing to generate own UUIDs when adding JSON (uuidgenerator.net)
Needing to convert from (CSV) format to JSON to add to our demo data files
Most tools we use (pgAdmin, psql, Mockaroo), can easily export to CSV, but not to JSON
Manually need to ensure there are no foreign key violations
Adding a new JSON file (for a new table), need to make sure loading order is correct
In demo-data folder, we have a generator.js file that specifies JSON file order to load into db. Any JSON files that are not specified in this JavaScript file will be loaded into db non-deterministically. Which means that demo data loading might fail on service startup because of foreign key issues.
Virtually impossible for a non-developer to add/edit demo data because the JSON format is hard for a human to read/parse
In recent adding of new vaccine demo data, it took one developer about 3-4 full days to do, with more than half of that time dealing with the first two pain points
Options for a New Approach
Option 1: Merge RefData Seed Tool into services
Incorporate RefData seed tool into each service that needs demo data
Is currently a separate Spring Boot application that runs when the OpenLMIS system has finished starting up
Loads CSV files defined in SourceFile.java to create/update to corresponding resources
Uses APIs to create/update resources (not tables directly)
Does data validation as well as data loading, not sure what this validation encompasses
Flexible in configuring CSV files for import, through mapping files (column names don't have to match database, can use a default value instead of a value in the CSV file, can skip columns)
Advantages over current approach, with estimate of advantage (small, medium, large)
UUID generation is handled by Java code, since it uses APIs
Large - this is very helpful
Using CSV files to load data
Large - this is very helpful
Updating data is relatively simple, editing the CSV files
Medium - this is better than editing JSON
Data validation, not sure extent of validation
Small - Demo data errors can be fixed. Once it is established, it shouldn't need to be constantly validated.
Supports mapping between resources, which field is configurable (by id, code, etc.)
Small/Medium - mapping is useful, but we don't need configurable mapping, as long as we can map data to each other by UUID
Configurable if resources are allowed to be updated
Small - we don't care about updating resources, as demo data is generally assumed to be added to a "fresh" system
Flexible in configuring CSV files for import, through mapping files
Small - we don't need flexible configuration, as long as the CSV files have the exact demo data we want to load into the system
Disadvantages over current approach, with estimate of disadvantage
Requires access to authentication/authorization and configuration for that, since it uses APIs
Large - either need to figure out how to wait until system is fully up to get auth token, or refactor code to not use APIs, which would require getting resource repository beans (and waiting long enough so that the beans exist)
Adding/removing resources to load would require modifying Java code (SourceFile.java)
Small/Medium - would need to figure out how to determine resources at runtime
UUIDs for demo data would be re-generated each time the service starts again, since we drop the database before reloading, need to refactor tests that assume certain demo data UUIDs exist in the system
Small/Medium?
Advantages over the other option
Other notables
Has unit tests
This tool is particularly useful for reference data, as most reference data has codes. But since other services do not use codes to identify resources, will need to use not human-readable UUIDs for those
Option 2: Performance Data loading mechanism
Use same data loading mechanism that we use to load performance data into each service
CommandLineRunner that runs on a Spring profile
CSV files in the classpath that are referenced by the Runner and inserted using Resource2Db into a specified table
Bypasses Hibernate, uses JDBC directly
Uses Mockaroo to generate new data, either new CSV files or new rows to existing CSV files
CSV files are checked into source control
Would be more flexible if the Runner just looks for all files in a directory and processes them, assuming the file name matches the table name
Advantages over current approach, with estimate of advantage (small, large)
UUID generation is handled by Mockaroo
Large
Using CSV files to load data, since Mockaroo can export to CSV
Large
Can handle mapping between demo data, as long as demo datasets are stored in Mockaroo
Small/Medium - we don't have to manually manage the mapping
Updating demo data is relatively simple; just editing the CSV files and keeping in sync with Mockaroo
Medium - this is better than editing JSON
Disadvantages over current approach, with estimate of disadvantage
Need to make sure CSV files in source control and Mockaroo are in sync (Mockaroo needs them in order to be able to generate more data properly)
Small
Still need to modify Java code (the Runner) in order to add/edit demo data
Small - it is easy enough to add a new entry to the Java code
Advantages over the other option
Don't need the full system to be started up (auth, APIs)
Large
Mockaroo is likely more flexible in how you can generate these CSV files, as you can define formulas to generate data (https://www.mockaroo.com/help/formulas)
Small/Medium
If the Runner is flexible enough to just process all CSV files in a directory, then we don't even need to modify the Java
Other notables
Resource2Db has unit tests
By tables, not resources, so will need to have entries for mapping/association tables (may potentially be a disadvantage over other option)
Recommendation
If there are no other viable options, the recommendation would be to copy option 2, use the performance data loading mechanism.
Both options have big benefit over the current approach, but option 1 has the problem of requiring auth and the APIs to be present. Also a lot of its configurability is not necessary for loading demo data.
The main drawback to option 2 seems to be keeping the demo data CSV files in sync between source control and Mockaroo.