Feb. 19, 2022

Refactoring a Python Codebase with LibCST

published by	Adam Stepinski
in blog	Instawork Engineering
original entry	Refactoring a Python Codebase with LibCST

How can an engineering team maintain consistent conventions in a growing codebase? When working on new code, engineers add new patterns to improve productivity. But usually there’s no time to refactor older code to use the new pattern. Soon, there are multiple approaches to the same problem. This actually harms productivity as engineers debate the merits of the different patterns, spend time reimplementing their features, etc.

So what’s the solution? Use a codemod. A codemod is simply a script that refactors your source code automatically. The most basic codemod can be a “find & replace” script, but a more advanced codemod operates on the semantic structure of the code.

It takes some effort to write a codemod, but once you have it, you can find all instances of old patterns in the code and automatically update them to the new ones! This ensures we don’t have a mix of old and new patterns in the codebase, so old patterns won’t be propagated. On the off-chance an engineer commits code with an old pattern, you can simply run the codemod again to refactor those instances.

Writing Codemods with LibCST

At Instawork, our refactoring efforts fall into two major categories:

Updating our code to use a new library or function. For example, we’re adopting more expects custom matchers in our unit tests to improve readability.
Moving classes to different Python modules. In our oldest repo, our largest file had over 23k lines(!) and we’ve successfully broken it down into a one-class-per-file structure.

We started looking for a codemod library to handle these two cases, and quickly settled on LibCST. LibCST has a strong pedigree as an open-source project from the Instagram engineering team. Instagram famously maintains one of the largest Python codebases in the world, and parent company Meta has a deep culture of using codemods. Additionally, LibCST supports all the latest Python 3 features, makes extensive use of type annotations, and comes with good documentation and unit tests. We felt optimistic that LibCST could fit our code-modding needs.

LibCST works by building up a CST, or concrete syntax tree. A CST represents a piece of code as a tree data structure. The nodes of the tree represent semantic language concepts such as expressions, statements, function calls, etc. Whitespace, newlines, and comments are also represented as nodes. Since a CST is a tree data structure, we can traverse it and modify it by adding, deleting, or changing nodes. Then, we can render the modified CST back to code, while preserving all of the formatting and comments of the original. The resulting code diff looks like a precise change made manually by a developer, but the process is fully automated.

Manipulating the CST is done with Visitors and Transformers. A Visitor traverses the CST without changing it. This allows us to explore the structure of the code, collect some metadata, or identify nodes that need to be changed. A Transformer is like a Visitor, but it can mutate the nodes to transform the final output. The best way to understand how Visitors and Transformers work together is with an example.

Example: Updating Unit Test Assertions

Years ago, Instawork adopted the expects library to write more expressive unit tests. Unfortunately, we couldn’t use the library for all assertions due to lack of support for mocks. So we ended up with 2 different styles of assertions in our unit tests. This led to confusion for new engineers. Eventually we added a custom matcher for mocks, so I had the opportunity to convert all of our existing code to the new style:

https://medium.com/media/4744e8dc6697dfc0425be7788f033008/href

Since the function arguments can be any Python expression, a simple find & replace with regular expression won’t suffice. This was the perfect opportunity to write a LibCST codemod to handle the refactor.

Before writing the codemod, I found it helped to visualize the original CST and the desired CST. This is the CST of the old format for mock assertions:

And this is the CST of the desired new format:

By visualizing the CST, I could formulate a plan for the codemod:

Identify a Call node where the function is an Attribute with Name “assert_called_with”.
Grab the mock’s name (“mock_call” in the example) from the Call node’s func.
Grab the assertion args from the Call node’s args.(“value1”, “value2” in the example).
Build a new CST tree by plugging in the variable name and assertion args into the new structure.

Step 1 is crucial to make sure we only apply the changes to mock assertions. I identified our target Call nodes with a visitor:

https://medium.com/media/4ae9942dd276a8d98042e169a5e89b86/href

I could now use this visitor within a Transformer. The transformer visits all call nodes in the file, and uses the visitor to see if it matches our pattern. If so, it constructs the new CST from nodes and returns it:

https://medium.com/media/ac1f7dc313cdfc12ab17bad6fa1a94c6/href

That’s it! I could now use the LibCST command-line tools to execute this codemod against all of our Python test files to make the change across the entire codebase. It only took a couple of minutes, and I could be sure the new code was correct and free of syntax errors.

Conclusion

Writing a codemod with LibCST can be tricky at first. It took us a while to get the hang of it. It’s easy to get lost in the layers of abstraction when writing code that manipulates other code. I found the following process helps break down the task into more manageable steps:

Visualize the CST before and after the changes. Write a unit test and print out the CST, or add a debugger breakpoint to interact with the tree. This will give you a feel for the shape of the tree, and help you spot the target nodes for modification.
Identify the signature of the target nodes. Perhaps you’re looking for all function calls with a given name. Or maybe it’s the second argument in that function call. You want to be specific enough to avoid false positives and false negatives.
Use Visitors to tag the identified nodes. Once you know how to identify the target nodes, you will need to write visitors to collect references to these nodes. I find it’s good to use several small visitors, each with a small job. One visitor uses another visitor to extract references from a small part of the tree. The output of these visitors is a set of target nodes and any other information needed for the codemod.
Use a final Transformer to modify the identified nodes. Once you have a set of target nodes, you can perform the codemod using a Transformer. When your transformer visits one of the target nodes, return the updated node with your desired changes. The transformer can call your visitors in the initialization step to have the references up-front.
Write unit tests to catch edge cases. LibCST makes it easy to write unit tests comparing a code snippet “before and after” running the codemod. I found it’s easy to write dozens of tests to make sure the codemod handles all edge cases of formatting and nested expressions.

We’re relying on codemods more and more to bring consistency to our growing Python codebase. As our team scales, that consistency makes it easier for new engineers to be productive from day 1. Our hope is that all codebase-wide changes will be done with codemods to ensure we avoid the pitfall of “competing standards”.

Do you see opportunities to use codemods (and LibCST) at your company? Let us know in the comments and we can suggest which approaches will work best!

Refactoring a Python Codebase with LibCST was originally published in Instawork Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.