published by | Adam Stepinski |
---|---|
in blog | Instawork Engineering |
original entry | Refactoring a Python Codebase with LibCST |
How can an engineering team maintain consistent conventions in a growing codebase? When working on new code, engineers add new patterns to improve productivity. But usually there’s no time to refactor older code to use the new pattern. Soon, there are multiple approaches to the same problem. This actually harms productivity as engineers debate the merits of the different patterns, spend time reimplementing their features, etc.
So what’s the solution? Use a codemod. A codemod is simply a script that refactors your source code automatically. The most basic codemod can be a “find & replace” script, but a more advanced codemod operates on the semantic structure of the code.
It takes some effort to write a codemod, but once you have it, you can find all instances of old patterns in the code and automatically update them to the new ones! This ensures we don’t have a mix of old and new patterns in the codebase, so old patterns won’t be propagated. On the off-chance an engineer commits code with an old pattern, you can simply run the codemod again to refactor those instances.
At Instawork, our refactoring efforts fall into two major categories:
We started looking for a codemod library to handle these two cases, and quickly settled on LibCST. LibCST has a strong pedigree as an open-source project from the Instagram engineering team. Instagram famously maintains one of the largest Python codebases in the world, and parent company Meta has a deep culture of using codemods. Additionally, LibCST supports all the latest Python 3 features, makes extensive use of type annotations, and comes with good documentation and unit tests. We felt optimistic that LibCST could fit our code-modding needs.
LibCST works by building up a CST, or concrete syntax tree. A CST represents a piece of code as a tree data structure. The nodes of the tree represent semantic language concepts such as expressions, statements, function calls, etc. Whitespace, newlines, and comments are also represented as nodes. Since a CST is a tree data structure, we can traverse it and modify it by adding, deleting, or changing nodes. Then, we can render the modified CST back to code, while preserving all of the formatting and comments of the original. The resulting code diff looks like a precise change made manually by a developer, but the process is fully automated.
Manipulating the CST is done with Visitors and Transformers. A Visitor traverses the CST without changing it. This allows us to explore the structure of the code, collect some metadata, or identify nodes that need to be changed. A Transformer is like a Visitor, but it can mutate the nodes to transform the final output. The best way to understand how Visitors and Transformers work together is with an example.
Years ago, Instawork adopted the expects library to write more expressive unit tests. Unfortunately, we couldn’t use the library for all assertions due to lack of support for mocks. So we ended up with 2 different styles of assertions in our unit tests. This led to confusion for new engineers. Eventually we added a custom matcher for mocks, so I had the opportunity to convert all of our existing code to the new style:
https://medium.com/media/4744e8dc6697dfc0425be7788f033008/hrefSince the function arguments can be any Python expression, a simple find & replace with regular expression won’t suffice. This was the perfect opportunity to write a LibCST codemod to handle the refactor.
Before writing the codemod, I found it helped to visualize the original CST and the desired CST. This is the CST of the old format for mock assertions:
And this is the CST of the desired new format:
By visualizing the CST, I could formulate a plan for the codemod:
Step 1 is crucial to make sure we only apply the changes to mock assertions. I identified our target Call nodes with a visitor:
https://medium.com/media/4ae9942dd276a8d98042e169a5e89b86/hrefI could now use this visitor within a Transformer. The transformer visits all call nodes in the file, and uses the visitor to see if it matches our pattern. If so, it constructs the new CST from nodes and returns it:
https://medium.com/media/ac1f7dc313cdfc12ab17bad6fa1a94c6/hrefThat’s it! I could now use the LibCST command-line tools to execute this codemod against all of our Python test files to make the change across the entire codebase. It only took a couple of minutes, and I could be sure the new code was correct and free of syntax errors.
Writing a codemod with LibCST can be tricky at first. It took us a while to get the hang of it. It’s easy to get lost in the layers of abstraction when writing code that manipulates other code. I found the following process helps break down the task into more manageable steps:
We’re relying on codemods more and more to bring consistency to our growing Python codebase. As our team scales, that consistency makes it easier for new engineers to be productive from day 1. Our hope is that all codebase-wide changes will be done with codemods to ensure we avoid the pitfall of “competing standards”.
Do you see opportunities to use codemods (and LibCST) at your company? Let us know in the comments and we can suggest which approaches will work best!
Refactoring a Python Codebase with LibCST was originally published in Instawork Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.