Xuanda Yang
Zhejiang University, China
th3charlie at gmail dot com
GitHub Profile
GSOC Project Page
This is the summary page of my GSOC 2020 project: Generalize mypyc IR to Make non-C Backends Possible, advised by Jukka Lehtosalo and Michael Sullivan at the mypy organization.
Mypyc is a compiler that compiles mypy-annotated, typed Python code into CPython extensions for improving performance. It was originally written as a tool to accelerate mypy at Dropbox, now it is becoming a generic compiler to boost any typed python codebases.
Beyond the C backend that mypyc has now, the mypyc team would like to experiment with other backends to try out different features. However, a problem with the current design is that mypyc embeds a lot of backend-specific information within the IR, which makes it hard to generalize to new backends since the implementer must refill the new backend-specific information to the IR under this design.
For example, PrimitiveOp
is an all-mighty IR that represents calls to functions, low-level operations, and even some exception handling. It relies heavily on a custom callback function to generate corresponding C code, which was simple and straight-forward when only C backend is needed since every new operation can be added with a simple new callback. But it is definitely not suited for multiple backends.
In this GSOC project, we expect to modify the current IR design and add new IR structs to improve the overall expressiveness of mypyc IR and eventually making non-C backends possible.
In the following sections, we will discuss the key points throughout the project and how we implement them. We will start by extracting the function call semantics from the all-mighty PrimitiveOp
. And then we will focus on low-level operations. Finally, we will summarize other miscellaneous work during this project. For each topic, we will provide links to GitHub PRs so interested readers can access them directly.
We will also discuss the remaining work and future directions based on this project and we will end with a short summary.
A large majority of PrimitiveOp
's usage is to represent function calls to the CPython C library and mypyc's C runtime library. Due to the complexity of different calls, the old design chose the callback function approach. To decouple the backend-specific codegen information from the function call itself, we design a simple IR element: CallC
, to represents calls to the C library. Since mypyc does not reimplement a virtual machine from scratch, we can safely expect non-C backends to continuously call the C-API from CPython. Therefore, this new IR element can be easily reused across backends. We support all kinds of calls presented in mypyc.
Being able to support function calls at a higher level with the new IR is not enough, we need also be able to represent all the old calls. And each call itself may present some variety comparing to the most general ones. Therefore, we further improve the expressiveness with variable number of arguments to match the C language feature, argument reordering to have different arguments order between the mypy AST and the final call, and truncated type so Python scope can have different return type from native functions. We also support new error kinds which utilize the mypyc exception transform pass to handle CPython exceptions properly.
Originally, mypyc relied on inlined low-level integer/pointer operations and macros. However, inline functions and macros are not universally available in every backend, for example, an x86 assembly backend. Also, inlined functions and macros put restrictions on our ability to do aggressive analysis and optimization between IRs since the information is only known during the final code generation. Finally, some existing function calls require integer/pointer as part of their arguments or exception handling. Therefore, implementing these low-level concepts and operations is a must in this project.
We start by working on low-level integers. Different from Python integers, which are of variable-length, low-level integers have fixed size on modern platforms. They are used to represent the return status of a function, predefined CPython constants, offsets of pointer arithmetic, and so on. We introduce int32
and int64
and make aliases with the CPython Py_ssize_t
. Based on these types, we implement BinaryIntOp
and ComparisonOp
to represent the arithmetic between them. We spent a significant amount of time here to build the integer comparison operations with the new IR and optimize them.
Pointer operations are also important. We need to be able to read/write a specific memory address, take the address of a given value, get a field from a struct, and do arbitrary pointer access with pointer arithmetic. For each of the above-mentioned functionality, we introduce a corresponding IR: LoadMem
, SetMem
, LoadAddress
, and GetElementPtr
. And in order to present CPython structs within the IR, we design a new RStruct
type with enough information for IR to working on related structs.
A good example demonstrating the power of these low-level operations is [mypyc] Implement builtins.len primitive for list, where we successfully build an op that originally used highly customized callback and macro directly within the IR building phase and generate generic IRs.
As our new IR is getting more and more fine-grained and low-level, the IR output is also getting more verbose. Several changes have been made regarding this topic to either improve the IR output to better match with the final generated code or remove redundant verbosity.
Our new IR greatly improves the expressiveness of mypyc IR so that we can easily build complex and even fast operations with them. The following PRs demonstrate the idea with a performance boost on specific operations.
A large part of our work is to migrate all the supported operations from the old style IR to our new IR. Since we carefully design the new IR elements and provide a working example for each of them. Migrating the rest of the ops only needs some time and patience. We do not list all the related PRs here.
You can find all the commits during the GSOC project here: All Commits
As this year's GSOC going to its end, we have a little leftovers to finish after the project. There are some remaining old-style ops that requires extra care and tweaks, which we recorded them in Remaining old style primitive ops. These ops should able to be done within a short amount of time after the project ends.
We believe this project eliminates some major blockers on the way of implementing new backends and we expect to see new experimental backend based on this project. As I write this post, one core member of the mypyc team is already starting to write an x86 assembly backend. We are looking forward to seeing more such experiments and the feedback from them, to further improve the mypyc IR based on that.
This project puts its focus mainly on the all-mighty PrimitiveOp
. There are still other components in the IR that are in some degree tightly bound to the C backend, for example, the Box
and the Unbox
operation. Another direction for future work would be to further improve the expressiveness of the IR, by removing the tightly coupled design in the remaining parts of the IR system.
In this GSOC project, we redesign part of the mypyc IR system and improve the IR's expressiveness, making experimenting with new backends from nearly impossible to practical.
I'd like to express my sincere gratitude to every person and organization that makes my GSOC 2020 adventure possible. I'd like to firstly thank Google and mypy for providing me this opportunity.
Especially, I'd like to thank my mentors Jukka Lehtosalo and Michael Sullivan. We have daily syncs every day on Gitter and weekly/monthly video meetings via Zoom. In every discussion, they helped me clean my thoughts and find out the best approach to meet our goals. They responded quickly to my PRs, giving high-quality review comments and suggestions. They mentored me with patience and passion and I feel connected even though we are several timezones/continents away. Their guidance is even beyond the scope of the project and helps me form good software engineering skills along the way.
I also would like to thank my parents for supporting me working on this project and Yaozhu Sun from the University of Hong Kong, who lit my passion for the field of compilers and programming languages two years ago. Finally, I'd like to thank Kanemura Miku from Hinatazaka 46 for all the mental support during this special summer.