Saturday, December 15, 2012

(JSON) Serializing (SQLAlchemy) Objects

A common task when building a web application (or REST API) is to take some data from a database and then ship it over the wire in some serialized format. Although the concepts of this post apply to pretty much any sort of serialization task, I am going to be using python and the SQLAlchemy to illustrate my current preferred solution.

The first thing that I tried for this was to add a to_json() method to all of my SQLAlchemy models. So, as an example, a User model may look something like
from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class User(Base):
    id = Column(Integer, primary_key=True)
    first_name = Column(String, nullable=False)
    last_name = Column(String, nullable=False)

    def to_json(self):
        return dict(id=self.id,
                    first_name=self.first_name,
                    last_name=self.last_name)
And for simple situations this works perfectly. However, let's add a little twist to this. Now let's assume that our IDs are not auto_increment integers, but some binary value (e.g. a UUID of sorts) and that we also have a field that contains the user's date of birth (dob). The problem we face now is that we can't just return the binary value for the ID and the DateTime object for the date of birth, because python's JSONEncoder doesn't know what to do with those. So, now, we have a class that looks something like this:
import uuid

from sqlalchemy import Column, DateTime, Integer, String
from sqlalchemy.types import BINARY
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class User(Base):
    id = Column(BINARY(16), primary_key=True, auto_increment=False)
    first_name = Column(String, nullable=False)
    last_name = Column(String, nullable=False)
    dob = Column(DateTime)

    def to_json(self):
        return dict(id=uuid.UUID(bytes=self.id).hex,
                    first_name=self.first_name,
                    last_name=self.last_name,
                    dob=self.dob.isoformat())
Will this work? Assuming there are no bugs in my code, then yes, this will most definitely work. However, as I see it, there are two main flaws with this solution:
  1. In most decent sized projects you will end up having quite a few models. So, this means, that for every model (and many of them will have quite a few more attributes (database table columns) than the aforementioned User with only 4 fields. Thus, you end up having to write this to_json() method with all of the attributes over and over again, wasting both time and increasing the chance of a bug.
  2. If you want to change the format of any of the values, say, you moved away from UUID4s for the IDs for all of your models to UUID1s, you have to go through every to_json() method and make the appropriate changes. Again, a waste of time and highly error prone.
As such, here is the solution that I have come up with that is working well so far. It is based on these two threads (1, 2) in StackOverflow but instead of using mixins, I create a separate Serializer class that takes the object to serialize as a parameter. You'll see why I do this shortly.

First, let's define the serializer. Don't be scared, this serializer only has to be created once (and you can just copy and paste it), and after that serializing any object becomes a piece of cake! Trust me!
import dateutil.parser

class JsonSerializer(object):
    """A serializer that provides methods to serialize and deserialize JSON 
    dictionaries.

    Note, one of the assumptions this serializer makes is that all objects that
    it is used to deserialize have a constructor that can take all of the
    attribute arguments. I.e. If you have an object with 3 attributes, the
    constructor needs to take those three attributes as keyword arguments.
    """

    __attributes__ = None
    """The attributes to be serialized by the seralizer.
    The implementor needs to provide these."""

    __required__ = None
    """The attributes that are required when deserializing.
    The implementor needs to provide these."""

    __attribute_serializer__ = None
    """The serializer to use for a specified attribute. If an attribute is not
    included here, no special serializer will be user.
    The implementor needs to provide these."""

    __object_class__ = None
    """The class that the deserializer should generate.
    The implementor needs to provide these."""

    serializers = dict(
                        id=dict(
                            serialize=lambda x: uuid.UUID(bytes=x).hex,
                            deserialiez=lambda x: uuid.UUID(hex=x).bytes
                        ),
                        date=dict(
                            serialize=lambda x, tz: x.isoformat()
                            deserialize=lambda x: dateutil.parser.parse(x)
                        )
                    )

    def deserialize(self, json, **kwargs):
        """Deserialize a JSON dictionary and return a populated object.

        This takes the JSON data, and deserializes it appropriately and then calls
        the constructor of the object to be created with all of the attributes.

        Args:
            json: The JSON dict with all of the data
            **kwargs: Optional values that can be used as defaults if they are not
                present in the JSON data
        Returns:
            The deserialized object.
        Raises:
            ValueError: If any of the required attributes are not present
        """
        d = dict()
        for attr in self.__attributes__:
            if attr in json:
                val = json[attr]
            elif attr in self.__required__:
                try:
                    val = kwargs[attr]
                except KeyError:
                    raise ValueError("{} must be set".format(attr))

            serializer = self.__attribute_serializer__.get(attr)
            if serializer:               
                d[attr] = self.serializers[serializer]['deserialize'](val)
            else:
                d[attr] = val

        return self.__object_class__(**d)

    def serialize(self, obj):
        """Serialize an object to a dictionary.

        Take all of the attributes defined in self.__attributes__ and create
        a dictionary containing those values.

        Args:
            obj: The object to serialize
        Returns:
            A dictionary containing all of the serialized data from the object.
        """
        d = dict()
        for attr in self.__attributes__:
            val = getattr(obj, attr)
            if val is None:
                continue
            serializer = self.__attribute_serializer__.get(attr)
            if serializer:
                d[attr] = self.serializers[serializer]['serialize'](val)
            else:
                d[attr] = val

        return d
Now, assuming there are no bugs in the code above from when I adapted it from our production code, you can create a serializer for your User object by simply doing something like this:
class UserJsonSerializer(JsonSerializer):
    __attributes__ = ['id', 'first_name', 'last_name', 'dob']
    __required__ = ['id', 'first_name', 'last_name']
    __attribute_serializer__ = dict(user_id='id', dob='date')
    __object_class__ = User
The best part is, for any new object that you create, all you have to do is create one of these serializers and you are good to go. No more writing of to_json() in each model. And to get it to do some serialization, just do:
my_json = UserJsonSerializer().serialize(user)
As it currently stands, this can be used as a mixin, and we could add JsonSerializer as one of the parent classes for our User model. The trouble with going that route, is that you can't pass arguments to the serializer class. For example, in our system we store all dates as UTC formated dates, but need to convert them to the local timezone of the current user. As the serializer currently stands, there is no way to pass it a timezone parameter. To do this, our JsonSerializer has a constructor that takes a timezone parameter that is then used in the serialization of dates. So, for example:
class JsonSerializer(object):
    
    ... all code that was here before ...

    def __init__(self, timezone):
        self.tz = timezone
Make sense? As an added benefit, we can also add more serializers to our default list of serializers in the constructor. For example, let's say our User object references a list of Email objects and we want to serialize that as well. So, first we'd create an EmailJsonSerializer just like we did for the User, but then add this email serializer to to the serializers in Users. Ok, that was a bit convoluted, so here is what I mean:
class EmailJsonSerializer(JsonSerializer):
    __attributes__ = ['user_id', 'email']
    __required__ = ['user_id', 'email']
    __attribute_serializer__ = dict(user_id='id')
    __object_class__ = Email


class UserJsonSerializer(JsonSerializer):
    __attributes__ = ['id', 'first_name', 'last_name', 'dob', 'emails']
    __required__ = ['id', 'first_name', 'last_name']
    __attribute_serializer__ = dict(user_id='id', dob='date', emails='emails')
    __object_class__ = User

    def __init__(self, timezone):
        super(UserJsonSerializer, self).__init__(timezone)
        self.serializers['emails'] = dict(
            serialize=lambda x:
                [UserEmailJsonSerializer(timezone).serialize(xx) for xx in x],
            deserialize=lambda x:
                [UserEmailJsonSerializer(timezone).deserialize(xx) for xx in x]
        )
Now when we call the serializer, it will not only serialize the contents of the User object, but also the contents of any and all Email objects associated with it (assuming you set up the one-to-many relationship properly in your SQLAlchemy models).

Again, while I used SQLAlchemy models to illustrate this pattern, this can work for pretty much object going to and from any type of serialized data. Happy coding!

1 comment:

  1. This looks good to me. Article is still relevant. Thanks.

    ReplyDelete