2019-08-07 TC Meeting notes
Date
Aug 7, 2019
7am PST / 4pm CEST
Meeting Link
Attendees
Discussion items
Time | Item | Who | Notes |
|---|---|---|---|
5m | Agenda and action item review | @Josh Zamor (Deactivated) |
|
10m | Performance tests | @Josh Zamor (Deactivated) |
|
30m | Reporting stack: improvements to handle bigger amounts of data | @Daniel Serkowski (Unlicensed) & @Mateusz Wedeł (Unlicensed) |
|
Notes
Reporting Stack Improvements for Data
Generated test data ~5k req, 180 req line items
Performance problem:
Estimate in production: 100k req per year
Availability has been an issue (m4.large instance w/ 50GB EBS)
System resources maxed out, e.g. disk space increased to 100GB
Increased memory to 16GB RAM
W/ this config
Some reports > 1min (timeout exceeded)
Primary resource hog: postgres (storage and CPU)
The data:
The username in a certain view is used to filter rows
A few observations:
The data source (the table) is 70 columns wide, for a query that needs ~4 of those. Query time can be optimized by creating a different view with only the columns needed.
The group by's are the biggest problem. IOW the duplication of rows is the biggest problem for this query. A data store without these duplicated rows is going to be a lot faster. Or a table structure where the rows can be filtered out quickly using an index (perhaps on username).
Answering the why behind the query is important. Why group by 4 columns without an aggregation? What does the final dashboard/report need to do for the person viewing it?