View Source

Proposed Reporting / Data Warehouse Strategy

To help OpenLMIS users make better sense of their supply chain data we propose developing a modern big data warehouse setup for reporting in OpenLMIS. To simplify the creation of indicators on an ad-hoc basis, we will create a unified data pipeline that will make it possible to move data from heterogeneous data sources (initially just OpenLMIS data) into a data warehouse where indicators can be generated and visualized in real-time, no matter the size of the data store. By implementing a data warehouse approach, we hope to establish a standardized way for storing health data from OpenLMIS, on top of which common tools (APIs and dashboards) for analysis can be built. We hope to build it in a way that other platforms like OpenSRP, DHIS2, OpenMRS, etc. can share the same data warehouse and reporting infrastructure. This would allow for consolidation of reporting across systems and enable the creation of composite indicators from across systems.

OpenLMIS Data Warehouse Overview

OpenLMIS teams need to be able to generate up-to-date indicators from system generated data. The queries generating these indicators may be complex, e.g. containing grouping, filtering, and aggregation clauses, and the datasets may be large, e.g. hundreds of millions of rows across historical data. In order to use OpenLMIS data in on-demand dashboards, we must be able to retrieve the results of these complex queries in ~100ms. We would also like to build in the flexibility to expand the system over time to handle complex event processing and machine learning tasks. To handle these requirements, we propose building a system composed of the following:

Data ingestion using Apache NiFi;
Data warehousing in the Hadoop File System (HDFS);
Stream processing using Apache Kafka;
OLAP database storage using Druid; and
Visualization using Apache Superset.

Data Ingestion and Coordination Using NiFi

To coordinate the task of retrieving data from OpenLMIS and doing basic transformations, we will use Apache NiFi - an open source tool that makes it easy to reliably process and distribute data.

Real-time Stream Processing using Kafka

Apache Kafka is a real-time stream processor that uses the publish-subscribe message pattern. We will use Kafka to receive incoming messages (data) and publish them to a specific topic-based queue that our data warehouse, Druid, will subscribe to and insert them into its database. Using Kafka we will also store data collected from OpenLMIS into a Hadoop data store (HDFS) for long term storage, facilitating the generation of historical insights.

High performance queries with Druid

Druid is an open source analytics data store designed for business intelligence (OLAP) queries on event data. Druid provides a rich query language that supports generation of complex indicators in near real-time (sub ~100ms) in datasets with billions of rows of data. Once the data is indexed (stored) in Druid, we can use Superset for exploratory analysis and to define the standardized JSON queries that can be used against the Druid API to build external dashboards and reporting tools.

Visualizing Data in Superset

Superset is an open source dashboard platform, initially developed by Airbnb, that provides an intuitive interface to explore and visualize datasets. Superset can visualize data from different data sources, but it is optimized to work with Druid. Superset provides a variety of rich data visualization types (charts, mapping, etc) out of the box, but can be expanded due to its nature of being an open source project. Superset supports robust user permissions, making it possible to generate different dashboards for health systems users depending on their level of access.

OpenLMIS > Reporting and Data Warehouse Strategy > superset-chart.png

Functional Scope - Prototype Reporting Phase (Oct 2017-Jan 2018)

Below is the initial proposed scope of this reporting engagement; it will be continually reviewed, re-prioritized, and adjusted based on guidance of the OpenLMIS Product Owner during the engagement.

Phase 1 - Data warehouse infrastructure setup

The initial phase of work will consist of setting up the initial data warehousing infrastructure to support a pipeline from data ingestion to visual reporting. This will consist initially of NiFi for data ingestion and transformation, Kafka for data routing, Hadoop (HDFS) for data storage and the OLAP database Druid to enable generation of real-time indicators and Superset for visualization. The system will be setup with Ansible scripts to run on Ona’s hosted data warehouse instance or on OpenLMIS AWS servers.

Creating packaged Docker instances are outside the scope of this initial phase for work. While an end goal, the packaging of these different tools will require a lot of work due to the devops complexity to ensure they run in a stable, reliable and scalable way. The initial focus will be setting up data infrastructure to allow demonstrating initial working reporting prototypes.

Phase 2 - Initial Data Reporting Prototypes

The initial focus will be to demonstrate the ability to generate a select number of DISC indicators and/or priority reports (Reporting and Analytics) and then visualize them in a dashboard reporting interface. We will start with 4-5 DISC indicators/reports identified by the vacccine requirements. These could include for example: on-time & in-full delivery, order & item fill rate, closed vial wasted, full stock availability. To do this, we will ingest static data provided in CSV/JSON format. This data should reflect the structure and type of data that OpenLMIS will generate. The OpenLMIS team will be responsible for helping to generate this mock data. The ingestion will be done by NiFi to allow data to be pulled from a static hosted CSV or JSON resource. The data will then be transformed if necessary (potentially joined with other data sets) and then type mapped into a suitable format for OLAP indexing in Druid. Once the data is in Druid, Superset will be used visualize the indicators per the DISC guidelines. In the process, we will seek to identify if aspects of Superset need to be customized or adapted for OpenLMIS use and explore other visualization platforms if needed. We will also prototype creating interactive map visuals using Ona’s open source Gisida mapping platform.

Phase 3 - Data Ingestion from OpenLMIS

Once we have demonstrated the ability to generate and visualize DISC indicators from static data, we will then begin to work to support ingesting data directly from OpenLMIS APIs. We will start initially by supporting 2-3 API endpoints from OpenLMIS for data ingestion using Ni-Fi into the data warehouse infrastructure. In the process, we will help map out what data transformations/joins are needed. This is a potentially complex process that will require support from the OenLMIS team. The OpenLMIS team will be responsible for providing a running instance of OpenLMIS to access and ensuring it’s populated with relevant data that can be used for reporting purposes. If time allows, we will continue to work to support additional API data endpoints. The goal of this this final initial phase is to be able to demonstrate an end-to-end working reporting solution.