fb

A simple setup for people just starting out

In the previous chapter, we spoke about a typical analytics setup in an organization.

After reading it, you might wonder: three big steps (load, transform, visualize) and investment in a data-warehouse seems overkill to me at the start, can I start with a simpler setup?

Yes you can.

Querying from your production DB

In fact, if you are in a very early stage and fits certain criteria below, you can probably skips some steps.

  • If your data comes from only one source (most likely your production database) then you can skip data loading process.
  • If you are running a simple, low-traffic website and having an additional analytics workload on your production database will not make a big impact on your application performance, then you can skip the data warehouse.
  • If your raw data is simple enough to be visualized out of the box, or your reporting needs are simple with no complex transformations, then you can skip the data transform and modeling process.

In short, your initial analytics setup can be very simple: just hook a BI tool to the production database of your application.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/a3db8d63-5127-4985-a84d-c142b87a71b5/Untitled.png

When you interact with dashboards in your BI tool, the data will be queried live from your application database. This way, in theory, you will get the data in real-time.

Querying from a replica DB

The simple setup above, of course, is not all sunshine and roses.

The highest risk you will face with the above setup is performance. Besides your normal production workload, your database now will also take on an additional analytics workloads, and may degrade your users’ experience.

The easiest solution here is to set up a replication database (replica for short) of your production database.

Check with your dev team to see if they have a replica that you can connect to. There’s a good chance that your dev team has already set something up.

Then, instead of connecting to production, point your BI tool to your replica instead.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/82158347-0a43-4ef8-b96f-57a72f215b5e/Untitled.png

This way, your production database will not be burdened by your analytical workload, while the data you receive is still relatively fresh (depending, of course, on how your dev team configures the replication interval)

If your app runs on a NoSQL database

If you run on NoSQL database (for instance, MongoDB or Cassandra), the above setup will not work for two reasons:

  1. Limited choice of reporting tool.

    Since SQL has become the defacto standard for analytics, most BI tools are designed to work with SQL, so it limits the choice of BI tools you may pick.

    Also, there is no standardized analytics interface across different NoSQL databases, so you will end up looking for a specialized solution that is designed specifically for your brand of NoSQL database.

  2. Limited analytics capability.

    Most NoSQL databases do not have strong support for analytics, both in terms of processing engine and querying interface. Try and write an aggregation query on a big MongoDB collection, and you will quickly understand why we say this. The query will be difficult to write, and the wait for results would be horrific.

    You might even get a surprise visit from your dev team for hogging up their production database.

At this stage, the recommended approach is again getting an SQL data warehouse, and then loading data from your NoSQL app into this data warehouse. This moves us towards the direction that we laid out in What does a typical analytics setup look like?.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/43b313e1-9219-4926-95f6-1467a63abd67/Untitled.png