Reporting and Data Warehouse Strategy

Proposed Reporting / Data Warehouse Strategy

To help OpenLMIS users make better sense of their supply chain data we propose developing a modern big data warehouse setup for reporting in OpenLMIS. To simplify the creation of indicators on an ad-hoc basis, we will create a unified data pipeline that will make it possible to move data from heterogeneous data sources (initially just OpenLMIS data) into a data warehouse where indicators can be generated and visualized in real-time, no matter the size of the data store. By implementing a data warehouse approach, we hope to establish a standardized way for storing health data from OpenLMIS, on top of which common tools (APIs and dashboards) for analysis can be built.   We hope to build it in a way that other platforms like OpenSRP, DHIS2, OpenMRS, etc. can share the same data warehouse and reporting infrastructure.  This would allow for consolidation of reporting across systems and enable the creation of composite indicators from across systems.

OpenLMIS Data Warehouse Overview

OpenLMIS teams need to be able to generate up-to-date indicators from system generated data. The queries generating these indicators may be complex, e.g. containing grouping, filtering, and aggregation clauses, and the datasets may be large, e.g. hundreds of millions of rows across historical data. In order to use OpenLMIS data in on-demand dashboards, we must be able to retrieve the results of these complex queries in ~100ms. We would also like to build in the flexibility to expand the system over time to handle complex event processing and machine learning tasks. To handle these requirements, we propose building a system composed of the following:

  1. Data ingestion using Apache NiFi;
  2. Data storage and querying using PostgreSQL;
  3. Visualization using Apache Superset.

Data Ingestion and Coordination Using NiFi

To coordinate the task of retrieving data from OpenLMIS and doing basic transformations, we will use Apache NiFi - an open source tool that makes it easy to reliably process and distribute data.

Data Storage and Querying Using PostgreSQL

NiFi will write all data from OpenLMIS APIs into a PostgreSQL database. We opted for PostgreSQL due to its usability and ability to perform well at the scale of a typical OpenLMIS implementation.

Visualizing Data in Superset

Superset is an open source dashboard platform, initially developed by Airbnb, that provides an intuitive interface to explore and visualize datasets. Superset can visualize data from different data sources, but it is optimized to work with Druid. Superset provides a variety of rich data visualization types (charts, mapping, etc) out of the box, but can be expanded due to its nature of being an open source project. Superset supports robust user permissions, making it possible to generate different dashboards for health systems users depending on their level of access.

Functional Scope - Prototype Reporting Phase (Oct 2017-Jan 2018)

Below is the initial proposed scope of this reporting engagement; it will be continually reviewed, re-prioritized, and adjusted based on guidance of the OpenLMIS Product Owner during the engagement. Note the below text makes reference to systems that are no longer in the current reporting stack – specifically Kafka, HDFS, and Druid. HDFS and Druid have been replaced by PostgreSQL, and Kafka was not required for the requisitions microservice.

Phase 1 - Data warehouse infrastructure setup

The initial phase of work will consist of setting up the initial data warehousing infrastructure to support a pipeline from data ingestion to visual reporting.  This will consist initially of NiFi for data ingestion and transformation, Kafka for data routing, Hadoop (HDFS) for data storage and the OLAP database Druid to enable generation of real-time indicators and Superset for visualization.  The system will be setup with Ansible scripts to run on Ona’s hosted data warehouse instance or on OpenLMIS AWS servers.

Creating packaged Docker instances are outside the scope of this initial phase for work.  While an end goal, the packaging of these different tools will require a lot of work due to the devops complexity to ensure they run in a stable, reliable and scalable way.  The initial focus will be setting up data infrastructure to allow demonstrating initial working reporting prototypes.

Phase 2 - Initial Data Reporting Prototypes

The initial focus will be to demonstrate the ability to generate a select number of DISC indicators and/or priority reports (Reporting and Analyticsand then visualize them in a dashboard reporting interface.  We will start with 4-5 DISC indicators/reports identified by the vacccine requirements.  These could include for example: on-time & in-full delivery, order & item fill rate, closed vial wasted, full stock availability.   To do this, we will ingest static data provided in CSV/JSON format.  This data should reflect the structure and type of data that OpenLMIS will generate. The OpenLMIS team will be responsible for helping to generate this mock data.   The ingestion will be done by NiFi to allow data to be pulled from a static hosted CSV or JSON resource.  The data will then be transformed if necessary (potentially joined with other data sets) and then type mapped into a suitable format for OLAP indexing in Druid.  Once the data is in Druid, Superset will be used visualize the indicators per the DISC guidelines.  In the process, we will seek to identify if aspects of Superset need to be customized or adapted for OpenLMIS use and explore other visualization platforms if needed.  We will also prototype creating interactive map visuals using Ona’s open source Gisida mapping platform.

Phase 3 - Data Ingestion from OpenLMIS

Once we have demonstrated the ability to generate and visualize DISC indicators from static data, we will then begin to work to support ingesting data directly from OpenLMIS APIs.  We will start initially by supporting 2-3 API endpoints from OpenLMIS for data ingestion using Ni-Fi into the data warehouse infrastructure.  In the process, we will help map out what data transformations/joins are needed.  This is a potentially complex process that will require support from the OenLMIS team.  The OpenLMIS team will be responsible for providing a running instance of OpenLMIS to access and ensuring its populated with relevant data that can be used for reporting purposes.  If time allows, we will continue to work to support additional API data endpoints.  The goal of this this final initial phase is to be able to demonstrate an end-to-end working reporting solution.