What can we do so that core benefits from some of the fixes and that Angola delivers this on time
Notes
Reporting Stack Improvements for Data
Generated test data ~5k req, 180 req line items
Performance problem:
Estimate in production: 100k req per year
Availability has been an issue (m4.large instance w/ 50GB EBS)
System resources maxed out, e.g. disk space increased to 100GB
Increased memory to 16GB RAM
W/ this config
Some reports > 1min (timeout exceeded)
Primary resource hog: postgres (storage and CPU)
The data:
The username in a certain view is used to filter rows
A few observations:
The data source (the table) is 70 columns wide, for a query that needs ~4 of those. Query time can be optimized by creating a different view with only the columns needed.
The group by's are the biggest problem. IOW the duplication of rows is the biggest problem for this query. A data store without these duplicated rows is going to be a lot faster. Or a table structure where the rows can be filtered out quickly using an index (perhaps on username).
Answering the why behind the query is important. Why group by 4 columns without an aggregation? What does the final dashboard/report need to do for the person viewing it?
Action Items
Daniel Serkowski (Unlicensed): Focusing on query (as well as table structure), if you can share about the use case (the why) in reporting channel more help can be provided (please reach out to Josh Zamor if we want a design review before the next TC).