Superset Tutorial

Notes from Email exchange from Ben

The documentation is crappy :(. Their website is a work in progress (at best). You can post questions to their GitHub although they rarely reply and there's a Gitter that is somewhat more helpful. Ona will be working on documentation for Superset as we start to develop for it, and I can share those docs once we have them put together. This is a mid-term priority for us currently
I'm not sure about overlays, CSS, and branding, but I can follow up on it for you
Superset is the visualization layer that sits on top of Druid. Technically when we write metrics (which I can show), you're writing Druid metrics in Sueprset. The two are intentionally heavily integrated. The data flow we'll be using is this

APIs --> NiFi --> Kafka --> Druid / Superset
We will also have NiFi pass the data to HDFS (s3) for historical indexing
So, all of the data goes into Hadoop and Druid both - Druid is for OLAP processing, and HDFS is for historical data

Right now the servers all contain data from multiple clients as we're building out POCs of the platform, so, we can't share access to the server directly. Are there particular questions you have about the configuration?
A majority of our time has been spent in data ingestion (NiFi) and some on aggregations (Druid / Superset). To date, we haven't done any custom development in Superset
Drawing data from PostgresSQL would work, and we've done that with some deployments, although the performance / load times in Superset with a PG back-end are noticeably slower. It would be ok for internal-only work, and potentially, although I would recommend against that approach in a PRD instance. Also note that you'll need to rebuild the dashboard from a PG back-end to a Druid back-end as PG relies on SQL whereas Druid queries rely on JSON. I can show you this on the call

Druid queries and JSON queries in Superset are almost one in the same

NiFi pulls every minute (but eventually OpenLMIS will push) published directly to Kafka and Hadoop/Druid. Kafka passes it to Druid.

Outcome: need to further articulate

Superset instructions

Sources → Druid Datasources includes stock_card_line_items, which is what all the visualizations are based on right now.

Think of a "data source" as a table.

Each slice refers to a single data-source. The same applies to filters.

A "slice" is a singe component (ie: visualization) within a dashboard.

A filer will modify all the slices on a dashboard that use the filter's datasource.

It's thus ideal, if possible, to have a single table per dashboard. That way, you can rely on just a single filter within the dashboard. (Otherwise, because you need a filter per data source, you can end up with filters which seem redundant to the user.)

Clay will look into whether we can embed Superset's charts in other pages (eg: within OpenLMIS).

metrics vs dimensions.
Dimension: The data you want to report on or group by.
Metrics: The numbers/calculations associated with your dimensions.

When creating a graph, using a time granularity of a day or a month (rather than the default of a day) can make the line smoother.

TODO:

Clay Crosby (Unlicensed) will look into how to apply CSS and branding to Superset dashboards, as well as whether arbitrary HTML elements (eg: labels, instructions) can be added to them.
Clay Crosby (Unlicensed) will look into whether slices may be embedded directly within external pages (eg: within OpenLMIS) and, if so, how authorization would be handled.