Tue 06 September 2016

Writing queries for Django models with JSONField

I've been writing a lot of python and Django lately and decided to write about the Postgres' jsonb type, Django and JSONField.

One interesting feature introduced in Django 1.9 is the JSONField. This field adds support for jsonb, a data type introduced in Postgres 9.4. In case you haven't read/hear about it before, jsonb allows you to store JSON data (it actually stores a binary representation) in a table column. It was already possible to store JSON objects as text, but the jsonb type enforces data to comply with the JSON format rules and of course, it gives you SQL syntax to make queries over the JSON's attributes.

Creating queries with JSONField seems to be easy according to the Django documentation, but once you move into more complex situations, where queries require relations and aggregate functions, things could easily get frustrating.

Let's say we have a table named users_stats, that stores daily interactions completed by the users of your application. Each record of the table has three columns: an id column that works as incremental primary key (PK), a date timestamp column named date and a jsonb column named data. A JSON stored in the data column looks like this:

 {
   "clicked_links": 50,
   "visited_pages": 12,
   "viewed_products": 130,
   "sold_products": 8
 }

Let's support you want the daily stats summary in a period of time. The following query will retrieve that summary using Django Queryset:

Using Django ORM

 UserStats.objects.filter(date__range(since, until))
     .annotate(l=RawSQL("((data->>'clicked_links')::int)", (0,)),
               p=RawSQL("((data->>'visited_pages')::int)", (0,)),
               v=RawSQL("((data->>'viewed_products')::int)", (0,)),
               s=RawSQL("((data->>'sold_products')::int)", (0,))) \
     .aggregate(clicked=Sum('l'),
                visited=Sum('p'),
                viewed=Sum('v'),
                sold=Sum('s'))

Unfortunately, the query above feels more like a hack than a solution; it's harder to understand than the produced SQL and you have to use RawSQL expressions to define each of the aggregate functions (annotations).

The equivalent SQL for this query would be:

 SELECT date, SUM((data->>'clicked_links')::int), SUM((data->>'visited_pages')::int),
              SUM((data->>'viewed_products')::int), SUM((data->>'sold_products')::int)
 FROM user_stats
 WHERE date BETWEEN 'YYYY-MM-DD' AND 'Y1Y1Y1-M1M1M1-D1D1'
 GROUP BY date;

Using the above query with Django is straightforward; we just need a Django connection object to execute the query:

 from django.db import connection


 query = """SELECT date, SUM((data->>'clicked_links')::int), 
                         SUM((data->>'visited_pages')::int),
                         SUM((data->>'viewed_products')::int),
                         SUM((data->>'sold_products')::int)
            FROM user_stats
            WHERE date BETWEEN 'YYYY-MM-DD' AND 'Y1Y1Y1-M1M1M1-D1D1'
            GROUP BY date;"""

 cursor = connection.cursor()
 cursor.execute(query)
 return cursor.fetchall()

Now, let's move that boilerplate code into a function:

Using SQL

 from django.db import connection


 def execute_raw_query(query):
     """Execute a RawQuery in the Database"""
     cursor = connection.cursor()
     cursor.execute(query)
     return cursor.fetchall()


 query = """SELECT date, SUM((data->>'clicked_links')::int),
                         SUM((data->>'visited_pages')::int),
                         SUM((data->>'viewed_products')::int),
                         SUM((data->>'sold_products')::int)
            FROM user_stats
            WHERE date BETWEEN 'YYYY-MM-DD' AND 'Y1Y1Y1-M1M1M1-D1D1'
            GROUP BY date;"""

 rows = execute_raw_query(query)

Finally, we should create a custom manager for the user_stats model and add a function that calculates the daily summary properly:

 # myapp/models.py
 from django.db import models
 from myapp.utils import execute_raw_query

 class UserStatsManager(models.Manager):

     def get_daily_user_stats(self, since, to):
         """Calculates user stats summary

         :param since: retrieves stats from this date
         :param to: retrieves stats until this date
         """

         # formatting dates to iso 8601
         since_str = date.strftime('%Y-%m-%d')
         to_str = date.strftime('%Y-%m-%d')

         query = """SELECT date, SUM((data->>'clicked_links')::int),
                                 SUM((data->>'visited_pages')::int),
                                 SUM((data->>'viewed_products')::int),
                                 SUM((data->>'sold_products')::int)
                    FROM user_stats
                    WHERE date BETWEEN '{}' AND '{}'
                    GROUP BY date;""".format(since_str, to_str)

         rows = execute_raw_query(query)
         return rows


  class UserStats(models.Model):
      ...

      objects = UserStatsManager()

Both queries do not return Django model objects, but a list of dictionaries containing the date and the aggregation parameters.

Conclusion

This post doesn't intent to declare that raw queries are the best choice while working with JsonField, but show that while support of Postgres' jsonb in Django is new, and there's still a lot of room for improvement, you can write complex SQL queries using SQL (as usual). If you find yourself struggling to create a query with JsonField through Django ORM or event worse, you hesitate about using JsonField, because you'll require harder queries than those examples at the Django docs, stop losing time and write some SQL.

References

I spent half a hour looking for references on how to create aggregation queries using Django ORM, and found this Stackoverflow question that seems to be the more accurate Django ORM's approach.