Illustration by Dan Freeman

Practicality Beats Purity - Pure SQL vs ORMs

2018-11-01 python engineering databases Cristian Medina

I’ve been using some form of a database throughout the entirety of my career. Sometimes single-file databases, sometimes full servers. Sometimes testing them, sometimes designing them, but a lot of times I was optimizing them. Even when learning programming with my dad, most of the apps I built were about storing and managing some type of data.

With all those years of experience, I definitely understand enough to know that I’m by no means an expert at any of it. There are several (an understatement) mechanisms by which folks make the best use of their database servers, almost all of them are tradeoffs in memory usage, space, look-up times, results retrieval, backup mechanisms, etc.

As expected, each database mechanism has their own quirks and optimizations, but the common theme is the language which you use to retrieve information: Structured Query Language (SQL). Different database engines implement different extensions to this language, some of which add powerful functionality, some of which just add confusion. But in general, SQL has been very successful in standardization across the industry.

This next chapter in the Practicality Beats Purity series covers the tradeoffs when using direct SQL queries to a database vs programming language abstractions that do it for you, like ORMs.

What’s an ORM?

Given the nature of data, interfacing with it can be complex. This leads to a number of constructs that help write optimum queries. In the object-oriented world, the most common concept is the Object Relational Mapper (ORM).

ORMs allow you to map a class written in a language like Python or Java to its corresponding SQL table structure in a given database. They take care of formulating queries for you, as well as setting up table relationships, indexes and other configurations. They can also help manage query transactions and database server connections.

With these systems, most queries map directly into class methods. For example, given a User class that models its equivalent table, accessing it over an ORM is as simple as executing Users.query.all(). This gets all the users in the table as a list of User class instances in code. Each property in the class corresponds to a table column.

From here, it’s easy to pick an instance and change its properties through an assignment operation like user.name = 'ABC', and then commit the changes to the database with session.commit(). It’s not necessary to know how to perform the correct UPDATE query to the database, the ORM does that.

Python has several ORM packages, the most famous one being SQLAlchemy, but there are others like PeeWee ORM that do similar functions.

Common reasons for using ORMs

There are many things that attract folks to using ORMs, here’s a short list of the most common:

Familiarity with object oriented concepts.
No need to learn SQL.
Your data is accessible as class objects, as opposed to tuples that you later coerce into objects.
The interface is abstracted above the database engine, so if for some reason you need to switch from SQLite to MySQL to PostgreSQL, your application will keep functioning without code changes (barring some corner cases of course).
There are migration tools built on these systems that help update the structure of your database over time, by applying or reverting schema changes.

What happens when you get into specifics?

There’s only so much that an abstraction layer can assume about your data. At some point you’ll need to perform manual optimizations. But doing so, requires deeper knowledge of the internal design and the APIs that makeup the ORM package. This means that instead of learning SQL, you have to learn ORM. And since ORMs differ in how they handle things, the whole purpose of having a common language is now out the window.

By default, ORMs will only retrieve the minimal amount of information needed to satisfy your query. Given a User class that references a Team class through a foreign key, listing all users will not include all columns from the Team table, only it’s identifier. You have to ask for the join explicitly.

Who cares? If you’re implementing an interface or endpoint which provides just a little bit of team information along with it (like perhaps the team name), you’ll find that default ORM behavior gets the list of all users in one query, but executes a separate query for the team name for each user. If you have 100 users in 30 teams, you could make up to 31 round trips to your database. It’s not obvious that this is happening unless you read the docs in detail, or consciously enable the ORM to show the queries it executes.

How do you fix it? Define the team reference in the User class to be an eager join instead of a lazy join (the default). This means you’ll have all the info in 1 query. The specifics of how to make this happen will vary depending on the ORM package that you are using.

Now that we added eager joins everywhere they are required, let’s look at something else. Assume the User and Team are needed in an interface that lists all tests executed by them across all products for the last month. It includes the users that executed them, the teams they’re in and the teams that own the test cases. Since we added eager joins, now we have a different problem: the query to the database is a larger join that retrieves duplicate information because the tests are owned by teams, and the users belong to teams.

Yes, we can optimize that too, but at this point you’re replacing learning SQL with learning how the ORM APIs allow you to improve this type of query. If you’re like me and you already know SQL, you’ll spend a lot of time trying to figure out which methods to call with which arguments in order to produce the SQL string that you know is optimal for this situation. ORMs differ in their implementations and while most do have detailed documentation, it’s not always easy to follow.

Scaling

Depending on what you’re trying to accomplish, it’s entirely possible that an ORM will not scale for your needs. Consider the example below.

We’re building a single-page web application with the idea to front-load as much data as possible in order to deliver a responsive interface. With the server in China (or behind a high latency firewall), it’s acceptable to wait several seconds to load 20k records because half the time goes into connecting to the server.

Adding a record paging mechanism will make for a worse experience. Whether you load one record or 100, you still incur a high penalty just to reach your server (several seconds per request). Trust me, this is more common than you think.

Let’s put together the endpoint function. Assuming we’re only working on User records which are part of a larger Team, we should retrieve 20k user-team entries. Seems simple enough, any ORM could do it, and any DB engine will hardly blink at retrieving the info. Returning jsonify(User.query.all()) from the function is all we need, right? That was easy!

Once you try this out, it becomes obvious that something isn’t right. One minute goes by and the function still hasn’t returned a response. Let’s work through what we did:

Function: Ask the ORM to get a list of users.
ORM: Translate the command to SELECT ... FROM users INNER JOIN teams ...
ORM: Ask underlying DB engine module to send the command.
DB server: receive SQL command, get data and send it over the network.
DB engine: Receive data and pass it to the ORM.
ORM: For every user in the list of all users, instantiate a User class, set its properties, instantiate a Team class, set its properties, append to a list.
ORM: Return list of User instances to the endpoint function.
Function: For every user in the list of all all User instances, turn each of those objects into a dict that’s JSON serializable.
Function: Return a list dictionaries to the REST framework of choice.
REST: Serialize the dictionaries and return the JSON response over HTTP.

Anything seem odd? We asked a DB for some stuff, it gave you the stuff in a list of tuples of strings (and maybe ints), for which you then instantiated 2 classes per tuple (where each class contains its own dict of properties), then did nothing with those classes other than instantiate 2 extra dicts per class, which are then serialized back into JSON and output on the network. 80% of the compute time went into making class instances and breaking them apart to serialize them.

Some ORMs have a way to control class instantiation, some don’t. You’ll have to learn how to use the ORM properly and understand your specific situation. Most folks may not run into this problem, but if you do, it can be a pain to figure out how to get the base response that avoids wasting so much compute. It may be easier to ditch the ORM entirely and just query through the database module.

Summarizing

We’re reminded once again that it’s always important to evaluate the problems we’re trying to solve before choosing off-the-shelf solutions.

Yes, SQL can be a pain to manage, but so can the ORM if you don’t know how to properly formulate queries, or how it works internally. ORMs will know how to optimize a query, but they will not know what you’re going to do with it. You’re essentially moving the problem from one domain into another. In doing so, don’t forget to ask whether the code is actually simpler and more maintainable, and whether its easy to tell where to find the implicit abstractions that get you into trouble.

For most cases, using an ORM is well worth the trouble. But don’t forget to budget time for learning how to use it.

python practicality orm sqlalchemy sql