Python Dataclass best practices (and why you should use them)

Date: 2023-10-29 | create | python |

DISCLOSURE: If you buy through affiliate links, I may earn a small commission. (disclosures)

I prefer static types > dynamic types as I believe it makes code more explicit and expressive, thus making software simpler to build and maintain long term. (IMO Types are a Simple Scalable System (3S) for managing complexity at scale).

Alas you don't always get to choose what technologies you use in your day-to-day work. In the past 5 years and 3 companies of my Software Engineering career, Python has been a large component of our systems. Python is a dynamic language at its core though its type annotations have gotten quite good recently.

Instagram - Python runs IG's primary backend monolith (see: Instagram's tech stack will surprise you)
Reddit - Reddit's primary backend monolith is built with Python, currently running 2.x(!)
Rippling - Backend monolith written in Python

This leads to the question we'll explore in this post:

Q: How can we wrangle complexity at scale in a dynamic language like Python?

Answer

TL;DR - Use types. Further, use dataclasses.

Python allows types via annotations. They're not as powerful or strict as most static-typed languages but they do move many errors to "build" time (i.e. when you run your type checker) which achieves a lot of the value in terms of improving devx speed and correctness. Thus we can utilize these annotations to narrow the possibilities throughout our system, making our business logic more explicit / precise and thus correct.

Now you might say that a large emphasis on types is non-Pythonic and gets away from the dynamic roots / power of the language. You might be right but I don't really care.

I believe the goal of any programming language / ecosystem is to be a good, useful tool for accomplishing your goals - largely building 3S systems to solve problems. In my experience non-explicit, non-precise languages (cough - dynamic - cough) always become a bottleneck at scale because the mental overhead required to understand what's going on increases exponentially.

Some clear examples:

Instagram - Started as untyped Python monolith -> Now fully-typed Python monolith, moving large / complex workloads to the fully-typed Hacklang monolith (PHP fork)
Reddit - Run as an untyped Python monolith for over a decade -> Called bankruptcy on the monolith (way too hard to change) and now rewriting everything in typed Golang
Rippling - Untyped Python monolith -> moving to fully-typed Python

In the rest of this post, we're going to explore how to best leverage dataclasses to make your code more precise, explicit, and easier to understand.

Note: All examples provided here are also available in Replit if you want to play around with them.

Python Dataclasses

The generic term for the pattern Python's dataclass fill is called a record. Records are typically lean data-only structures containing a fixed set of named, typed fields.

"""
Example 1: Simple dataclasses
* Temperature, Unit pairs
"""
print("Example 1: Simple dataclasses")


class TemperatureUnit(Enum):
  CELSIUS = 1
  FARENHEIGHT = 2


@dataclass
class Temperature:
  TemperatureMagnitude: float
  Unit: TemperatureUnit


temperature_tuple = (50.0, TemperatureUnit.CELSIUS)
print(f"TemperatureTuple: {temperature_tuple[0]}") # prints - TemperatureTuple: 50.0

temperature_dc = Temperature(50.0, TemperatureUnit.CELSIUS)
print(f"TemperatureDC: {temperature_dc.TemperatureMagnitude}") # prints - TemperatureDC: 50.0

Why they're useful:

We often want to group data
We want that data group to be explicit, easy to understand, and hard to get wrong -> use fields and types to constrain
Tuples quickly become cumbersome (usually at 3+ values)

Dataclasses are great, lightweight constructs for explicitly constraining possibilities and lowering mental overhead.

As a rule of thumb: if you're considering returning / passing around a group of multiple values and might consider a tuple, consider a dataclass instead.

Immutable Dataclasses

A common refrain in programming is that mutability is the root of all evil (or at least a major cause of a lot of complex, hard to maintain, and fragile software). But the truth is that all software must contain mutability somewhere otherwise it's probably not very useful.

So the real takeaway is that poorly constrained mutability makes systems incredibly hard to reason about and thus to build. So constrain your mutability.

Vanilla dataclasses are mutable out-of-the-box. This makes sense as Python is default mutable as well - so at least it's consistent. But for a good (3S) system we almost never want this as it's so easy to get wrong. Instead, we want to build a pit of success by constraining mutability via enforcing immutability.

We can make our dataclasses immutable by utilizing the frozen parameter. This will throw if anyone tries to mutate the object at runtime and most type checkers can pick up on this and warn / error at buildtime thus making more error cases visible earlier in the dev cycle.

"""
Example 2: Immutable dataclasses
* Show a mutable thing you shouldn't do probably
"""
print("Example 2: Immutable dataclasses")

print("Example: MutableDC")
@dataclass
class MutableDC:
  AnInt: int

def bad_mutation(dc: MutableDC) -> None:
  dc.AnInt = 0

mutable_dc = MutableDC(1)
print(f"Before mutation: {mutable_dc}") # prints - Before mutation: MutableDC(AnInt=1)

bad_mutation(mutable_dc)
print(f"After mutation: {mutable_dc}") # prints - After mutation: MutableDC(AnInt=0)

print("Example: ImmutableDC")
@dataclass(frozen=True)
class ImmutableDC:
  AnInt: int

def bad_mutation_blocked(dc: ImmutableDC) -> None:
  dc.AnInt = 0

immutable_dc = ImmutableDC(1)
print(f"Before mutation: {immutable_dc}") # prints - Before mutation: ImmutableDC(AnInt=1)

# Fails! dataclasses.FrozenInstanceError: cannot assign to field 'AnInt'
bad_mutation_blocked(immutable_dc)
print(f"After mutation: {immutable_dc}") # (never runs)

Now you might be saying - okay this makes sense but this example is contrived, no one would ever write code that mutates the passed in parameter. And I'd say "ho ho very funny! Tell that to past me who spent a whole day last week debugging weird values to find that params were being mutated 5 levels down!"

Do yourself a favor and build the pit of success to make it impossible to do bad things.

Explicit Dataclasses

So the whole idea of types and dataclasses is to make things explicit. Dataclasses certainly help over alternatives like unnamed tuples but we'll still commonly run into cases where they become similarly implicit with high mental overhead. I've most commonly seen this happen when we are modeling a record with many values (typically 3+).

Dataclasses come with a built-in constructor that's based on position. When we have less than 3 values, this can be sufficient as it's not too hard to remember what the properties are via order and type.

However as soon as we get over that amount or have several fields with the same types, this can become exceedingly confusing. It's often easy to forget what field<>value means what and that means it's often easy to set the wrong thing to the wrong field to the wrong value!

Another insidious issue with order-based constructors is if someone refactors the order of the properties it changes the order of the constructor parameters. This could mean a lot of fields are now being initialized with the wrong values and if you don't have type errors (i.e. the swapped fields are compatible types) or tests covering this - you may never know til runtime!

The easiest way I've found to solve this is to utilize the kw_only flag which makes it so that every field<>value must be explicitly set on the dataclass. Ofc you can still set a field<>value incorrectly but at least you're forced to read the field you're setting before doing so.

This solves both problems:

Easy to read what field is set to what value
Hard(er) to set wrong field to wrong value

"""
Example 3: Explicit dataclasses
* Show dataclasses at scale
"""
print("Example 3: Explicit dataclasses")

@dataclass(frozen=True)
class NoKeywordDC:
  aThousand: int # always one
  aTwelveHundred: int # always two  
  aElevenHundred: int # always three

many_dcs = [
  NoKeywordDC(
    1000,
    1200,
    1100
  ),
  NoKeywordDC(
    1000,
    1200,
    1100
  ),
  NoKeywordDC(
    1000,
    1200,
    1100
  ),
  NoKeywordDC(
    1000,
    1100,
    1200
  ),
  NoKeywordDC(
    1000,
    1200,
    1100
  )
]
# Q: Which one is set wrong?
# A: The 4th DC is different

@dataclass(frozen=True, kw_only=True)
class KeywordDC:
  aThousand: int # always one
  aTwelveHundred: int # always two  
  aElevenHundred: int # always three

many_dcs = [
  KeywordDC(
    aThousand=1000,
    aTwelveHundred=1200,
    aElevenHundred=1100
  ),
  KeywordDC(
    aThousand=1000,
    aTwelveHundred=1200,
    aElevenHundred=1100
  ),
  KeywordDC(
    aThousand=1000,
    aTwelveHundred=1200,
    aElevenHundred=1100
  ),
  KeywordDC(
    aThousand=1000,
    aTwelveHundred=1100,
    aElevenHundred=1200
  ),
  KeywordDC(
    aThousand=1000,
    aTwelveHundred=1200,
    aElevenHundred=1100
  ),
]
# Q: Which one is set wrong?
# A: The 4th DC is different

When I get to choose my tech stack I typically reach for languages with great ergonomics for writing explicit code - current F# and Typescript. But you don't always get to choose your stack so hopefully this helps you wrangle the implicit complexity in your own Python projects.

If you're interested in learning more about F#, checkout:

Want more like this?