Health & Information Service
Description
Confluence content
mentioned on
- https://openlmis.atlassian.net/wiki/spaces/OP/pages/114266501/Backlog+Grooming+Sprint+31
- https://openlmis.atlassian.net/wiki/spaces/OP/pages/114699580/Backlog+Grooming+Sprint+32
- https://openlmis.atlassian.net/wiki/pages/viewpage.action?pageId=114700626
- https://openlmis.atlassian.net/wiki/spaces/OP/pages/115209092/Backlog+Grooming+Sprint+34
- https://openlmis.atlassian.net/wiki/spaces/OP/pages/115578212/Backlog+Grooming+Sprint+35
- https://openlmis.atlassian.net/wiki/spaces/OP/pages/126222360/Backlog+Grooming+Sprint+40
QAlity Plus - Test Management
Checklists
Activity

Łukasz Lewczyński November 28, 2017 at 8:21 AM
In my opinion situation from the test case was the following (we will assume that max db clients are 50). The first 50 requests are handled properly by creating a db client and send call to database, the 51th request tries to create the db client but because of max number of clients the database throw an exception and request was killed -> user got 5xx error code. In this case service works fine because it is available and only single request was not handled properly. The issue will b if after the 51th request all next requests are handled in the same way (db exception).

Josh Zamor November 27, 2017 at 11:06 PM
and : Apologies I just got to this.
First clarification is that yes I agree with how you we're sourcing critical, OK, etc. It's right from Consul and it's what we wanted.
Second could we revisit what pointed out about the 200 returned when referencedata was down? That doesn't sound correct at all, however perhaps referencedata's check needs to be updated? Lets not lose this as an important immediate use for this is to ensure our automated testing (contract tests) run AFTER this new endpoint says everything is started and ready.

Łukasz Lewczyński November 23, 2017 at 2:16 PM
Also I think in this case the service still work correctly and only some requests failed. I think (because this is controlled by consul) we should not waste too much time on this ticket. If you got two different responses then I think service works file (it asks consul for data) and we should close the ticket.

Łukasz Lewczyński November 23, 2017 at 1:41 PM
This error is from database and I think is not related with consul. Simple there are too many threads that want access to database and I assume each one create a new database client. Also I am not sure how to fix this because status is set by consul not by health service.
cc:
Paweł Albecki November 23, 2017 at 1:04 PM(edited)
1. Service is built https://github.com/OpenLMIS/openlmis-diagnostics
2. Output when service is not working
503
3. Output when service works fine
200
4. Output when too many requests (50 runs of `curl -s "http://host/api/facilities?access_token=xyz&[1-10000]`)
I got in logs
but /api/health endpoint returned 200 still (I waited about 1 minute)
Details
Details
Assignee

Reporter

Story Points
Original estimate
Time tracking
Components
Sprint
Priority
Time Assistant
Open Time Assistant
Time Assistant

Services may start, stop, take a variable amount of time to become available, and may at times become overloaded. To support a less fragile testing setup, increase visibility in production systems and help support future scaling priorities we should have a simple way to report on the application's health.
AC:
build a health service which queries Consul
queries to Consul should use the health status each Service registers
the HTTP status those services use with Consul is documented in: http://docs.openlmis.org/en/latest/conventions/serviceHealth.html
it should provide an GET operation on a resource named
which returns a JSON for each Consul service registered. It should allow un-authenticated access. HTTP statuses returned should be:
200 if all services have PASSING status
429 if any of the services return WARNING status
503 if any of the services return CRITICAL status
a UI page should be built which prints this information nicely, available somewhere nondescript. (optional) ()