Using BAP as a Research and Engineering Toolkit

Probably the number one feature of BAP is by far the module design and language facilities that are immediately available due to the fact that it is written in OCaml, even though it is very hard to distinguish since the developers are so highly talented. That's not to start a language war, but just to highlight why having useful tools that assist in development is so critical. To start, the module system that it has been built around facilitates maximal reuse of code at minimum expense of undue effort wrangling with difficulties presented by a language. If you contrast this to Java, for example, you'll recognize that it can often be difficult to express what you need when everything is an object or class at the type level. In OCaml, functions represent the mechanism of composition, and are given first class treatment by the module system. You may have noticed if you’ve developed in a variety of languages that often Gang of Four design systems arise out of what is essentially difficulty in expressing what the compiler should do on your behalf when it is not so directly concisely expressable what that may entail. It is much easier to write and maintain good code when these various needs can be cleanly expressed. The module system is fantastic for authoring abstract interfaces that other components can then fill in with what is needed. Merlin is superb for interrogating the compiler for what it sees. Rewriting source and first hand code generation is possible with ppx, which is really wonderful for when you want to derive some facility for a given type using annotations. Other languages do have look-alikes to these features, but they are only just beginning to catch up to the functional paradigm from which many advanced language features originated. I’ve used BAP to develop custom disassemblers, static analysis tools for binaries on my private time, and have plans for custom interpreters and compiler components all while paying an eye to code quality and correctness. With BAP and OCaml, I was able to implement a phased, custom disassembler that can target multiple architectures with support for multiple platforms, all while maintaining a manicured interface for allowing both consumers and my tests to use or compose exactly the functionality that they need. In addition, the custom static analyses and interpreter implementations that I will be building would take substantially longer to author without the base library facilities made freely available by BAP. To give an example of why these highlights above are so great, I’ll show just what components of BAP my existing utilities consume, what they provide, and what future my utilities will use. In particular, the phased disassembler first seeks to distinguish from the superset of disassembly what cannot be a valid instruction using some powerful algorithmic heuristics, and then seeks to back propagate this information to fully discriminate any path into a violation of some assembler invariants that no compiler would produce. The final result is a superset of disassembly with some structural properties in regards to the unremitting noise that remains in interpreting the bytes as assembler. I managed this with a handful of new insights about disassembly, control flow graph structure and topology which are discussed at length in my white paper. To do any of this in practice however, functionality for computing over at least addresses, memory, executable images, assembler, graphs and semantics were needed, with data structures that are interoperable with each of these was absolutely critical. Those data structures include trees, hash sets and maps, and a multitude of typical functionality for operating with these types that wouldn’t otherwise be possible without a OCaml’s polymorphic typing and module features. Additionally, I plan to use this disassembly superset to feed into our static analysis to provide guarantees on accuracy while at the same time facilitate a feedback between the disassembly and the analysis for accuracy in each. It helps to provide a versioned analysis output that is dependent on the results from each possible interpretation decision, in our specific case. This way, the plugin architecture can mutually reinforce itself across symbiotic analyses! The static analysis itself is even more hungry for BAP’s features, this time both consuming and customizing capabilities ranging from the intermediate representation (structure) to the intermediate language (assembler semantics), static single assignment, and more. BAP is therefore critical to our success. One really fantastic bonus of all of this is that under the hood, you get to take advantage of the spectacular power of LLVM, so if you ever need to dip into compilation capabilities or algorithms they are there. In my case, writing the non-compiler components in OCaml helped a lot because it allowed me to distill the disassembly algorithm into just the mathematical components about bytes and graph structures, which have nothing to do with specifics about architectures or platforms. The discipline that BAP introduces you to, in facilitating your focus on just the algorithmic goodies, gives me comfort as a researcher, because I can make progress on the things that I care most about, and not worry about implementation errata. It’s very exciting to think I can get so many features without much effort, since personally I like the idea of people being able to use my code as much as possible - satisfaction is pretty much guaranteed. Python is admittedly a very popular favorite in the reverse engineering and hacking community. It is and suits exactly what hackers namesake speak of: fast at hacking together a very quick solution. But OCaml has an interactive interpreter as well, and with the size and scope of BAP it’s hard to imagine maintaining anything that you cannot lasting establish confidence in until you run it (when it blows up over type errors). Python has it’s place, but BAPs ambition is enormous, objectives are long term and and capabilities robust. I enjoy using python when I can, but I personally like to know ahead of time when code that certainly wouldn't work must be changed much better. Nothing against python.

Comments