Health & Information Service

Description

Services may start, stop, take a variable amount of time to become available, and may at times become overloaded. To support a less fragile testing setup, increase visibility in production systems and help support future scaling priorities we should have a simple way to report on the application's health.

AC:

build a health service which queries Consul
queries to Consul should use the health status each Service registers
the HTTP status those services use with Consul is documented in: http://docs.openlmis.org/en/latest/conventions/serviceHealth.html
it should provide an GET operation on a resource named
/api/health
which returns a JSON for each Consul service registered. It should allow un-authenticated access. HTTP statuses returned should be:
- 200 if all services have PASSING status
- 429 if any of the services return WARNING status
- 503 if any of the services return CRITICAL status
a UI page should be built which prints this information nicely, available somewhere nondescript. (optional) (https://openlmis.atlassian.net/browse/OLMIS-3651#icft=OLMIS-3651)

Linked work items

blocks

OLMIS-3651

Health & Information screen

relates to

OLMIS-1650

Service Metrics

OLMIS-3070

Migrate to Docker Swarm

QAlity Plus - Test Management

Checklists

Activity

Show:

Łukasz Lewczyński

November 28, 2017 at 8:21 AM

@Josh Zamor @Paweł Albecki In my opinion situation from the test case was the following (we will assume that max db clients are 50). The first 50 requests are handled properly by creating a db client and send call to database, the 51th request tries to create the db client but because of max number of clients the database throw an exception and request was killed -> user got 5xx error code. In this case service works fine because it is available and only single request was not handled properly. The issue will b if after the 51th request all next requests are handled in the same way (db exception).

Josh Zamor

November 27, 2017 at 11:06 PM

@Łukasz Lewczyński and @Paweł Albecki: Apologies I just got to this.

First clarification is that yes @Łukasz Lewczyński I agree with how you we're sourcing critical, OK, etc. It's right from Consul and it's what we wanted.

Second could we revisit what @Paweł Albecki pointed out about the 200 returned when referencedata was down? That doesn't sound correct at all, however perhaps referencedata's check needs to be updated? Lets not lose this as an important immediate use for this is to ensure our automated testing (contract tests) run AFTER this new endpoint says everything is started and ready.

Łukasz Lewczyński

November 23, 2017 at 2:16 PM

@Paweł Albecki Also I think in this case the service still work correctly and only some requests failed. I think (because this is controlled by consul) we should not waste too much time on this ticket. If you got two different responses then I think service works file (it asks consul for data) and we should close the ticket.

Łukasz Lewczyński

November 23, 2017 at 1:41 PM

@Paweł Albecki This error is from database and I think is not related with consul. Simple there are too many threads that want access to database and I assume each one create a new database client. Also I am not sure how to fix this because status is set by consul not by health service.

cc: @Nikodem Graczewski

Paweł Albecki

November 23, 2017 at 1:04 PM

(edited)

1. Service is built https://github.com/OpenLMIS/openlmis-diagnostics

2. Output when service is not working
503

[ {
  "node" : "b221ed9479df",
  "checkId" : "service:8afd5f55-e43a-427a-8756-e949ae7c13db-referencedata",
  "name" : "Service 'referencedata' check",
  "status" : "CRITICAL",
  "notes" : "",
  "output" : "",
  "serviceId" : "8afd5f55-e43a-427a-8756-e949ae7c13db-referencedata",
  "serviceName" : "referencedata"
} ]

3. Output when service works fine
200

[ {
  "node" : "b221ed9479df",
  "checkId" : "service:8afd5f55-e43a-427a-8756-e949ae7c13db-referencedata",
  "name" : "Service 'referencedata' check",
  "status" : "PASSING",
  "notes" : "",
  "output" : "HTTP GET http://172.18.0.12:8080/health: 200  Output: {\n  \"status\" : \"UP\"\n}",
  "serviceId" : "8afd5f55-e43a-427a-8756-e949ae7c13db-referencedata",
  "serviceName" : "referencedata"
} ]

4. Output when too many requests (50 runs of `curl -s "http://host/api/facilities?access_token=xyz&[1-10000]`)
I got in logs

referencedata_1          | org.postgresql.util.PSQLException: FATAL: sorry, too many clients already

but /api/health endpoint returned 200 still (I waited about 1 minute)

Resize issue view side panel

Done

Pinned fields

Click on the next to a field label to start pinning.

Details

Assignee

Łukasz Lewczyński(Deactivated)

Reporter

Josh Zamor(Deactivated)

Story Points

Original estimate

Time tracking

3d 4h logged

Components

Architecture

Sprint

Add sprint

Priority

Minor

Time Assistant

Created January 12, 2017 at 5:47 PM

Updated November 28, 2017 at 8:21 AM

Resolved November 23, 2017 at 3:03 PM