Making Josys data analytics platform more manageable

In this article, I would like to discuss how our data analytics team in Josys have evolved this year, and outline each steps towards a more manageable data analytics platform.

Originally a department within the Raksul Inc. company, Josys Inc. has now split off as an independent company, offering the SaaS “Josys” whose goal is to facilitate IT administrators’s day-to-day tasks. The product was launched in September of last year, and we as a company are still developing. Especially for the data analytics team, we have only been put together as a team in April of this year.

Prior to this, there was only one data engineer to manage all the data such as KPI measurements and product usage (not me), and all analyses were done on Redash with a little help of R scripts here and there. There were many manual tasks on the consumers of these data, i.e. the sales team, such as exporting CSV files from Redash and importing them to their own spreadsheets to identify which customers they need to call today.

While Redash is a versatile BI tool that we continue to use today, it is not very good at managing many queries over a long period of time. As many other data engineers/analysts in our company have commented, once there are enough number of queries, we can no longer identify which queries are actually used in the business reports, and which queries are safe to delete. Reviewing queries is not easy either, and each team has to come up with their own solution.

Set up a repository to store SQL

To address this, the first step we took at Josys data analytics team was to create a GitHub repository to store all our SQLs. At the moment, this is the current directory structure in our repository:

├── README.md
├── bigquery
│   ├── sql
│   │   ├── gcp_project_1
│   │   │   ├── mart_foo
│   │   │   │   ├── some_query.sql
│   │   ├── ...
│   ├── ...
├── redash
│   ├── dashboard
│   │   ├── feature_1
│   │   │   ├── some_measurement.sql
│   │   │   └── another_measurement.
│   │   ├── some_other_query.sql
│   │   ├── ...
└── ...

Once all queries were migrated, both reviewing and maintaining the queries became so much more easier. For reviewing, now all we needed to do was raise a pull request on GitHub and assign a reviewer. All the review comments and changes were now visible in a single page, which may seem obvious to an application developer, but was still a drastic change from when we were going back-to-back from Slack threads to the Redash SQL editor.

Query maintenance was also a much more efficient, since we now had the option to globally search queries for a specific column or table name. Before, when application development team had modified the schema, we needed to 1) identify all queries that referenced the column/table, and 2) navigate to the query editor for each page, and finally 3) make the necessary changes. Obviously, this process is prone to mistakes such as forgetting to update one or more queries, and also is time consuming. After the GitHub migration, searching queries and editing queries could be done much faster, and these changes could also be reviewed by another person.

Agree on a coding format

The next step after our migration to GitHub was to agree on an SQL coding format. After reviewing past discussions among other data analytics teams in the company, we decided to go with this format:

    , students.code
    , campus.name
from students
    join campus on students.campus_id = campus.id
where students.enrolled
    and campus.name is not null

The key features of this format are:

  1. user lowercase for everything
  2. commas and operators should be placed at the beginning of the next line
  3. join operators should be indented w.r.t. from clauses

This article from cookpad (in Japanese) explains the intentions behind the rules extensively, but to summarize,

  1. syntax highlighting is enough to denote which words are reserved keywords (e.g. concat)
  2. leading commas and operators helps readers understand how many columns/conditions there are in the query
  3. from clause is a result of all the joined tables, and also join is not a clause so should be indented

While we have not yet installed a linter, SQLFluff is recommended to lint/format this style of SQL (you can try it online here).

Write a script to automate deployment to BigQuery

As part of establishing the Josys data analytics platform as a team, we have also set up BigQuery as our data warehouse (I have also written about this process here). All analyses on BigQuery are managed as "views", not project-level queries, and our repository stores the queries that define these views. On top of this, we also added a python script to automate updating these views with queries written in our local machine.

To illustrate this, let’s say we have

  • a dataset mart_customers, and
  • a view onboarding_status in the dataset, which belongs to
  • josys-analytics GCP project.

In this case, we would have a corresponding file in our repository bigquery/sql/josys-analytics/mart_customers/onboarding_status.sql.

Now, if we didn’t have the script and we made some changes to the SQL file, we would have to:

  • open BigQuery in our browser,
  • navigate to the view in question,
  • paste the SQL from the local file to the browser editor,
  • and finally, save the updated SQL.

This is quite the hassle, and would seem to be too much work just to fix a small typo. But with our script, we can simply run

python update_views.py

and handle updating the views in one command. Internally, this script uses the python BigQuery client and a pre-defined list of SQL files for each dataset. The script calls the [client.query method](https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.client.Client#google_cloud_bigquery_client_Client_query) to run create or replace view DDL statements. Needless to say, this script has made updating the BigQuery views very simple.

Final words

To summarize, Josys analytics team has

  • set up a repository to store our queries,
  • agreed on a coding style for our SQLs, and
  • wrote a script to automate deployment to our data warehouse.

These steps have vastly simplified our workflow, and helped increase the quality of the data we provide to stakeholders.

Lastly, I would like to acknowledge the data analytics team at Raksul department, whose repository we have heavily referenced for setting up our own.

Thanks for reading, and happy holidays!