Data code model part II: Python Typing
Using python type constructs to structure your python code.
This is part 2 of a series on data code modelling for data engineering. In part 1, we went over abstract base classes in python and how these can be used to design software contracts between python classes:
Data Code Modelling: Abstract Base Classes
When working in data engineering, you will often have to work with python code bases to manage all sorts of processes in your data stack from ETL, to orchestration, to infrastructure management, to observability and much more. One view of data engineering is that of a blend of software engineering and data science. Oftentimes as data engineers, we will …
In this article, we will go over python typing. A good type system design is the foundation of a robust code base. Python is extensively typed and has extensive support for generating custom types and using these types at various locations throughout the code through the Typing library.
💡 Python typing is an EXTENSIVE topic. Therefore, this article will only cover some essentials on this topic.
Why care about python types
There are a few main reasons why it pays dividends to lay out your python typing more carefully:
Types are an efficient expression of function. It is a lot easier to read and write code with proper typing annotation than dealing with tons of documentation.
Types can be used by static code analysers like mypy to provide insights on potential code issues BEFORE it is run.
Types can be used by python libraries to enforce specific class behavior. For example, dataclasses and pydantic can be used to create a data model class which uses types to enforce specific outputs (more on that in part 3).
Python type hints
A type hint is a annotation of sorts that can be added to python code to signify the types of various types of objects and attributes. Type hinting can be done at a few important locations in your code:
Class attributes
Function arguments
Function outputs
Variables
To illustrate this, view the example below:
class DataFrame: | |
name: str | |
rows: int | |
def __init__(self, name: str, rows: int): | |
self.name = name | |
self.rows = rows | |
def process_dataframe(df: DataFrame) -> str: | |
return f"Processing {df.name} with {df.rows} rows." | |
df_name: str = "user_data" | |
df_rows: int = 1000000 | |
df: DataFrame = DataFrame(df_name, df_rows) | |
result: str = process_dataframe(df) |
You will notice we are using the type constructs str, int
and the custom class DataFrame at several places in our code. This shows a powerful feature of typing: All python data types and user generated custom objects can be used as a type hint.
This means that it’s also possible to use constructs from third party libraries such as pandas or numpys as a type hint. Thus, a type hint like df: pd.DataFrame is completely acceptable!
Why does this work? Python type hints do NOT impact your python code during runtime. They can be used by other libraries to create additional behavior, but in the “vanilla” implementation, they don’t impact how your code runs which is intended by design.
Generic types
A subset of python types are called the generic types. A generic type is characterised by a syntax like type[…] where … can be any other type. This is generally reserved for types that can be constructed as supersets of other types. For example, let’s say we have a list of integers in python, this could be type hinted like list[int] . This is pretty powerful as it allows us to have a more precise definition of this these sorts of types in our code. This can get infinitely complex too so list[list[list[str]]] is also an acceptable type hint.
Usually it’s rather obvious a type is a generic type. For example, for collections like list, dict or set it makes sense they are generics. As of current writing, it is not yet possible to determine if a type is generic simply by checking the base type (like list) but it is possible to confirm a generic once it’s parameterised (e.g list[int] ). Generally this can be confirmed like so:
from typing import get_origin, get_args | |
# Define some generic and non-generic types | |
generic_type = list[int] | |
non_generic_type = int | |
generic_type_without_params = list | |
# Function to check if a type is a generic type | |
def is_generic_type(tp): | |
return get_origin(tp) is not None or hasattr(tp, '__origin__') and tp.__origin__ is not None | |
# Function to check if a type is a parameterized generic type | |
def is_parameterized_generic_type(tp): | |
return get_origin(tp) is not None and len(get_args(tp)) > 0 | |
if __name__ == "__main__": | |
print(f'Is {generic_type} a generic type? {is_generic_type(generic_type)}') # True | |
print(f'Is {non_generic_type} a generic type? {is_generic_type(non_generic_type)}') # False | |
print(f'Is {generic_type_without_params} a generic type? {is_generic_type(generic_type_without_params)}') # True | |
print(f'Is {generic_type} a parameterized generic type? {is_parameterized_generic_type(generic_type)}') # True | |
print(f'Is {generic_type_without_params} a parameterized generic type? {is_parameterized_generic_type(generic_type_without_params)}') # False |
This is all just to show that, although generics are widely used, they don’t generally inherit from a single hierarchy, they’re rather objects with related implementations.
There are a bunch of generic types defined in various python standard libraries that help a user implement and understand these types correctly. A non exhaustive list is provided below:
Others
Annotated types
An annotated type can be seen as a custom type definition in python. It’s possible to define a type and associate some metadata with it. Generally, the metadata from these annotated types can be used to add additional instructions for validation of data. We will get back to leveraging that when we get into pydantic , but for now, just know you can define an annotated type like so:
from typing import Annotated | |
from typing_extensions import Annotated | |
# Define an annotated type with metadata | |
Age = Annotated[int, "Must be a non-negative integer"] | |
def check_age(age: Age): | |
if age < 0: | |
raise ValueError("Age must be a non-negative integer") | |
print(f"Age is valid: {age}") | |
# Usage examples | |
try: | |
check_age(25) # Valid | |
check_age(-5) # Invalid, will raise ValueError | |
except ValueError as e: | |
print(e) |
Functions or other callable objects can be annotated using collections.abc.Callable or typing.Callable. For example, Callable[[int], str] indicates a function that takes an int and returns a str. An example is:
from collections.abc import Callable, Awaitable | |
# Function with callable parameter | |
def feeder(get_next_item: Callable[[], str]) -> None: | |
... # Body | |
# Function with multiple callable parameters | |
def async_query(on_success: Callable[[int], None], | |
on_error: Callable[[int, Exception], None]) -> None: | |
... # Body | |
# Async function example | |
async def on_update(value: str) -> None: | |
... # Body | |
callback: Callable[[str], Awaitable[None]] = on_update |
The subscription syntax for Callable requires two values: the argument list and the return type. Using an ellipsis ...
as the argument list indicates that any parameter list is acceptable:
def concat(x: str, y: str) -> str: | |
return x + y | |
x: Callable[..., str] | |
x = str # OK | |
x = concat # Also OK |
Special types and forms
Self
Self can be used to reference to the instance of a class object the code is currently operating on. We’ve seen this a lot in part 1 of this series when working with classes, but an example is:
class SQLDataSource(DataSource): | |
connection_params = {} | |
@classmethod | |
def configure(cls, config): | |
cls.connection_params = config | |
return f"Configured SQL database with parameters: {cls.connection_params}" | |
def execute_query(self, query): | |
return f"Executing SQL query: '{query}' with parameters {self.connection_params}" |
This is useful for building out functionality between different components of your class.
Any
The Any type simply indicates a object can be of any type. As such it does not convey any additional information other than indicating to static code checkers this variable can be any type.
from typing import Any | |
def process_item(item: Any) -> None: | |
print(f"Processing item: {item}") | |
# Usage examples | |
process_item(123) # Processing an integer | |
process_item("Hello") # Processing a string | |
process_item([1, 2, 3]) # Processing a list |
Use this construct sparingly as using this a lot usually indicates you are not clear on the functioning of your code. If you are in the need of defining a generic function working on various types, use a TypeVar instead:
from typing import TypeVar, List | |
T = TypeVar('T') | |
def duplicate(item: T) -> List[T]: | |
return [item, item] | |
# Usage examples | |
print(duplicate(123)) # [123, 123] | |
print(duplicate("Hello")) # ['Hello', 'Hello'] | |
print(duplicate([1, 2, 3])) # [[1, 2, 3], [1, 2, 3]] |
Generally, typevar is better because:
Type Safety: TypeVar helps maintain type consistency, which means the function or class can work with any type while ensuring that the same type is used throughout. This prevents type-related errors that could occur if different types are inadvertently mixed.
Code Readability and Intent: Using TypeVar clearly communicates the intent that a function or class is designed to be generic and work with any type, but consistently. This makes the code easier to understand and maintain.
Static Type Checking: Type checkers (like mypy) can provide better error checking and autocompletion support when TypeVar is used. This can catch potential bugs at development time instead of runtime.
Union
Union is a typing construct that indicates the object it is encapsulating can be either type.
For example writing Union[str,int] defines a object as being either a string or an integer. Please do note this also means that when you define Union on a generic type which can be paramterized with another type like dict[Union[str,int]] , this will accept dictionaries that contain all strings, all ints or a combination of the two.
from typing import Union | |
def process_record(record_id: Union[str, int]) -> str: | |
if isinstance(record_id, str): | |
return f"Processing record with string ID: {record_id}" | |
elif isinstance(record_id, int): | |
return f"Processing record with integer ID: {record_id}" | |
else: | |
return "Unsupported record ID type" | |
print(process_record("abc123")) # Output: Processing record with string ID: abc123 | |
print(process_record(456789)) # Output: Processing record with integer ID: 456789 |
Please be aware of this distinction: Union does NOT cover the case where we are using strings and integers mixed together. Rather, it’s an either or.
For dictionaries in specific, it is worth exploring the TypedDict which allows you to define more explicit typing on specific types of keys van values.
from typing import TypedDict | |
class DataPipelineConfig(TypedDict): | |
name: str | |
batch_size: int | |
is_active: bool | |
def display_pipeline_config(config: DataPipelineConfig) -> str: | |
status = 'active' if config['is_active'] else 'inactive' | |
return (f"Pipeline {config['name']} with batch size {config['batch_size']} is currently {status}.") | |
pipeline_config: DataPipelineConfig = { | |
"name": "User ETL", | |
"batch_size": 1000, | |
"is_active": True | |
} | |
print(display_pipeline_config(pipeline_config)) | |
# Output: Pipeline User ETL with batch size 1000 is currently active. |
Generally however, if you need to be this elaborate with typing, it usually makes sense to define a dataclass or pydantic model instead (more on that in part 3).
Optional
As the name implies, optional indicates a object is optional. This is most often seen with optional class attributes that are only used within specific contexts.
from typing import Optional | |
def process_data(file_path: str, delimiter: Optional[str] = None) -> str: | |
delimiter_info = f" with delimiter '{delimiter}'" if delimiter else "" | |
return f"Processing file at {file_path}{delimiter_info}" | |
print(process_data("/path/to/file.csv")) # Output: Processing file at /path/to/file.csv | |
print(process_data("/path/to/file.csv", ",")) # Output: Processing file at /path/to/file.csv with delimiter ',' |
Literal
The literal type can be used to define a literal instance of a type. For example, type hinting with Literal[42] will indicate this object should always have a intenger value of 42
. This is useful for configuration-like objects that have a set of hard coded values.
from typing import Literal | |
def set_environment(env: Literal["development", "staging", "production"]) -> str: | |
return f"Environment set to {env}" | |
print(set_environment("development")) # Output: Environment set to development | |
print(set_environment("production")) # Output: Environment set to production | |
# The following would raise a type error with a type checker like mypy, as "test" is not a valid literal | |
# print(set_environment("test")) |
If there is a larger amount of objects that depend on this configuration, it can be beneficial to define it as an enum instead.
from enum import Enum | |
class Environment(Enum): | |
DEVELOPMENT = "development" | |
STAGING = "staging" | |
PRODUCTION = "production" | |
def set_environment(env: Environment) -> str: | |
return f"Environment set to {env.value}" | |
print(set_environment(Environment.DEVELOPMENT)) # Output: Environment set to development | |
print(set_environment(Environment.PRODUCTION)) # Output: Environment set to production |
Inspecting types in classes
Often when you are not too familiar with the inner workings objects in a code base or external library it can be difficult sometimes to determine the right type to use when laying out your typing system. In the past, I would have resorted to reading a bunch of source code to figure this out. It is also possible to determine this during run time by inspecting classes and variables.
Using vars
The vars
function returns the __dict__
attribute of the given object, which contains all the attributes of the object.
class DataPipeline: | |
def __init__(self, name: str, batch_size: int): | |
self.name = name | |
self.batch_size = batch_size | |
self.status = "inactive" | |
pipeline = DataPipeline(name="ETL Pipeline", batch_size=500) | |
print(vars(pipeline)) | |
# Output: {'name': 'ETL Pipeline', 'batch_size': 500, 'status': 'inactive'} |
Using dir
The dir
function returns a list of all the attributes and methods of the given object.
print(dir(pipeline)) | |
# Output: ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'batch_size', 'name', 'status'] |
Using type
The type
function returns the type of the given object.
print(type(pipeline)) | |
# Output: <class '__main__.DataPipeline'> | |
print(type(pipeline.name)) | |
# Output: <class 'str'> | |
print(type(pipeline.batch_size)) | |
# Output: <class 'int'> |
Using the inspect
Library
The inspect
library provides several useful functions to get information about live objects, such as modules, classes, methods, functions, tracebacks, and code objects.
import inspect | |
# Get the class definition of the object | |
print(inspect.getmembers(pipeline)) | |
# Output: [('__class__', <class '__main__.DataPipeline'>), ('__delattr__', <method-wrapper '__delattr__' of DataPipeline object at 0x7f8b6f3b1f10>), ('__dict__', {'name': 'ETL Pipeline', 'batch_size': 500, 'status': 'inactive'}), ('__dir__', <built-in method __dir__ of DataPipeline object at 0x7f8b6f3b1f10>), ...] | |
# Get the signature of the __init__ method | |
print(inspect.signature(DataPipeline.__init__)) | |
# Output: (self, name: str, batch_size: int) |
These methods and functions can help you inspect and understand the types used in classes and objects, making it easier to implement type hinting and understand the structure of unfamiliar codebases.
With that, we’re going to conclude this section on python typing. Do note again that this is a massive topic in python and we’ve just scratched the surface here.
Moving on, you may wonder (or know) that typing and type hints in and of themselves don’t enforce anything in your code. This is the natural design of python as a dynamically typed language.
However, how can we take these typing constructs and actually do something in our code other than document things? This is where a framework like pydantic comes into play. In the following section, we will describe how to use pydantic to create data models in your code that leverage our previous knowledge on abstract base classes and typing to create a much more powerful data code model.