Friday, July 30, 2021

Breaking Protocol (Buffers): Reverse Engineering gRPC Binaries

by Ethan Shackelford

The Basics

gRPC is an open-source RPC framework from Google which leverages automatic code generation to allow easy integration to a number of languages. Architecturally, it follows the standard seen in many other RPC frameworks: services are defined which determine the available RPCs. It uses HTTP version 2 as its transport, and supports plain HTTP as well as HTTPS for secure communication. Services and messages, which act as the structures passed to and returned by defined RPCs, are defined as protocol buffers. Protocol buffers are a common serialization solution, also designed by Google.

Protocol Buffers

Serialization using protobufs is accomplished by definining services and messages in .proto files, which are then used by the protoc protocol buffer compiler to generate boilerplate code in whatever language you're working in. An example .proto file might look like the following:

// Declares which syntax version is to follow; read by protoc
syntax = "proto3";

// package name allows for namespacing to avoid conflicts
// between message types. Will also determine namespace in C++
package stringmanipulation;


// The Service definition: this specifies what RPCs are offered
// by the service
service StringManipulation {

    // First RPC. RPC definitions are like function prototypes:
    // RPC name, argument types, and return type is specified.
    rpc reverseString (StringRequest) returns (StringReply) {}

    // Second RPC. There can be arbitrarily many defined for
    // a service.
    rpc uppercaseString (StringRequest) returns (StringReply) {}
}

// Example of a message definition, containing only scalar values.
// Each message field has a defined type, a name, and a field number.
message innerMessage {
    int32 some_val = 1;
    string some_string = 2;
}

// It is also possible to specify an enum type. This can
// be used as a member of other messages.
enum testEnumeration {
    ZERO = 0;
    ONE = 1;
    TWO = 2;
    THREE = 3;
    FOUR = 4;
    FIVE = 5;
}

// messages can contain other messages as field types.
message complexMessage {
    innerMessage some_message = 1;
    testEnumeration innerEnum = 2;
}

// This message is the type used as the input to both defined RPCs.
// Messages can be arbitrarily nested, and contain arbitrarily complex types.
message StringRequest {
    complexMessage cm = 1;
    string original = 2;
    int64 timestamp = 3;
    int32 testval = 4;
    int32 testval2 = 5;
    int32 testval3 = 6;
}

// This message is the type for the return value of both defined RPCs.
message StringReply {
    string result = 4;
    int64 timestamp = 2;
    complexMessage cm = 3;
}

There is a lot more to protocol buffers and the available options, if you're interested Google has a very good language guide.

gRPC

gRPC is an RPC implementation designed to use protobufs to take care of all boilerplating necessary for implementation, as well as provided functions to manage the connection between the RPC server and its clients. The majority of compiled code in a gRPC server binary will likely be either gRPC library code and autogenerated classes, stubs etc. created with protoc. Only the actual implementation of RPCs is required of the developer and accomplished by extending the base Service class generated by protoc based on the definitions in .proto files..

Transport

gRPC uses HTTP2 for transport, which can either be on top of a TLS connection, or in the clear. gRPC also supports mTLS out of the box. What type of channel is used is configured by the developer while setting up the server/client.

Authentication

As mentioned above, gRPC support mTLS, wherein both the server and the client are identified based on exchanged TLS certificates. This appears to be the most common authentication mechanism seen in the wild (though "no authentication" is also popular). gRPC also supports Google's weird ALTS which I've never seen actually being used, as well as token-based authentication.

It is also possible that the built-in authentication mechanisms will be eschewed for a custom authentication mechanism. Such a custom implementation is of particular interest from a security perspective, as the need for a custom mechanism suggests a more complex (and thus more error prone) authentication requirement.

gRPC Server Implementation

The following will be an overview of the major parts of a gRPC server implementation in C++. A compiled gRPC server binary can be extremely difficult to follow, thanks to the extensive automatically generated code and heavy use of gRPC library functions. Understanding the rough structure that any such server will follow (important function calls and their arguments) will greatly improve your ability to make sense of things and identify relevant sections of code which may present an attack surface.

Server Setup

The following is the setup boilerplate for a simple gRPC server. While a real implementation will likely be more complex, the function calls seen here will be the ones to look for in unraveling the code.

void RunServer() {
    std::string listen = "127.0.0.1:50006";
    // This is the class defined to implement RPCs, will be covered later
    StringManipulationImpl service;

    ServerBuilder builder;

    builder.AddListeningPort(listen, grpc::InsecureServerCredentials());
    builder.RegisterService(&service);

    std::unique_ptr<grpc::Server> server(builder.BuildAndStart());
    std::cout << "Server listening on port: " << listen << "\n";
    server->Wait();
}
  • builder.AddListeningPort: This function sets up the listening socket as well as handling the transport setup for the channel.
    • arg1: addr_uri: a string composed of the IP address and port to listen on, separated by a colon. i.e. "127.0.0.1:50001"
    • arg2: creds: The credentials associated with the server. The function call used here to generate credentials will indicate what kind of transport is being used, as follows:
      • InsecureServerCredentials: No encryption; plain HTTP2
      • SslServerCredentials: TLS is in use, meaning the client can verify the server and communication will be encrypted. If client authentication (mTLS) is to be used, relevant options will be passed to this function call. For example, setting opts.client_certificate_request to GRPC_SSL_REQUEST_AND_REQUIRE_CLIENT_CERTIFICATE_AND_VERIFY will require the client supply a valid certificate. Any potential vulnerabilities at this point will be in the options passed to the SslServerCredentials constructor, and will be familiar to any consultant. Do they verify the client certificate? Are self-signed certificates allowed? etc., standard TLS issues.
  • builder.RegisterService: This crucial function is what determines what services (and thereby what RPC calls) are available to a connecting client. This function is called as many times as there are services. The argument to the function is an instance of the class which actually implements the logic for each of the RPCs -- custom code. This is the main point of interest for any gRPC server code review or static analysis, as it will contain the clients own implementation, where the likelihood of mistakes and errors will be higher.
RPC Implementation

The following is the implementation of the StringManipulationImpl instance passed to RegisterService above.

class StringManipulationImpl : public stringmanipulation::StringManipulation::Service {
    Status reverseString(ServerContext *context, 
                         const StringRequest *request, 
                         StringReply *reply) {


        std::string original = request->original();
        std::string working_copy = original;
        std::reverse(working_copy.begin(), working_copy.end());
        reply->set_result(working_copy);

        struct timeval tv;
        gettimeofday(&tv, NULL);

        printf("[%ld|%s] reverseString(\"%s\") -> \"%s\"\n", 
                tv.tv_sec, 
                context->peer().c_str(), 
                request->original().c_str(), 
                working_copy.c_str());

        return Status::OK;
    }

    Status uppercaseString(ServerContext *context, 
                           const StringRequest *request, 
                           StringReply *reply) {

        std::string working_copy = request->original();
        for (auto &c: working_copy) c = toupper(c);
        reply->set_result(working_copy.c_str());

        struct timeval tv;
        gettimeofday(&tv, NULL);

        printf("[%ld|%s] uppercaseString(\"%s\") -> \"%s\"\n", 
                tv.tv_sec, 
                context->peer().c_str(), 
                request->original().c_str(), 
                working_copy.c_str());

        return Status::OK;

    }
};

Here we see the implementation for each of the two defined RPCs for the StringManipulation service. This is accomplished by extending the base service class generated by protoc. gRPC implementation code like this will often follow this naming scheme, or something like it -- the service name, appended by "Impl," "Implementation," etc.

Static Analysis

Finding Interesting Logic

These functions, generally, are among the most interesting targets in any test of a gRPC service. The bulk of the logic baked into a gRPC binary will be library code, and these functions which will actually be parsing and handling the data transmitted via the gRPC link. These functions can be located/categorized by looking for calls to builder.RegisterService.

 

Here we see just one call, because the example is simple, but in a more complex implementation there may be many calls to this function. Each one represents a particular service being made available, and will allow for the tracking down of the implementations of each RPC for those services. Navigating to the cross reference address, we see that an object is being passed to this function. Keep in mind this binary has been pre-annotated for clarity and the initial output of the reverse engineering tool will likely be less clear. However the function calls we care about should be clear enough to follow without much effort. 

 

We see that before being passed to RegisterService, the stringManipulationImplInstance (name added by me) is being passed to a function, StringManipulationImpl::StringManipulationImpl. Based both on the context and the demangled name, this is a constructor for whatever class this is. We can see the constructor itself is very simple: 


The function calls another constructor (the base class constructor) on the passed object, then sets the value at object offset 0. In C++, this offset is usually (and in this case) reserved for the class's vtable. Navigating to that address, we can see it:

 

Because this binary is not stripped, the actual names of the functions (matching the RPCs) are displayed. With a stripped binary, this is not the case, however an important quirk of the gRPC implementation results in the vtables for service implementations always being structured in a particular way, as follows.

  • The first two entries in the vtable are constructor/destructors.
  • Each subsequent entry is one of the custom RPC implementations, in the order that they appear in the .proto file. This means that if you are in possession of the .proto file for a particular service, even if a binary is stripped, you can quickly identify which implementation corresponds to which RPC. And if you don't have the .proto file, but do have the binary, there is tooling available which is very effective at recovering .proto files from gRPC binaries, which will be covered later. This is helpful not only because you may get a hint at what the RPC does based on its name, but also because you will know the exact types of each of the arguments.

Anatomy of an RPC

There are a few details which will be common to all RPC implementations which will aid greatly in reverse engineering these functions. The first are the arguments to the functions:

  • Argument 1: Return value, usually of type grpc::Status. This is a C++ ABI thing, see section 3.1.3.1 of the Itanium C++ ABI Spec. Tracking sections of the code which write to this argument may be helpful in understanding authorization logic which may be baked into the function, for example if a function is called, and depending on its return value, arg1 is set to either grpc::Status::OK or grpc::Status::CANCELLED, that function may have something to do with access controls.

  • Argument 2: The this pointer. Points to the instance of whatever service class the RPC is a method on.
  • Argument 3: ServerContext. From the gRPC documentation:

    A ServerContext or CallbackServerContext allows the code implementing a service handler to:

    • Add custom initial and trailing metadata key-value pairs that will propagated to the client side.
    • Control call settings such as compression and authentication.
    • Access metadata coming from the client.
    • Get performance metrics (ie, census).

    We can see in this function that the context is being accessed in a call to ServerContextBase::peer, which retrieves metadata containing the client's IP and port. For the purposes of reverse engineering, that means that accesses of this argument (or method calls on it) can be used to access metadata and/or authentication information associated with the client calling the RPC. So, it may be of interest regarding authentication/authorization auditing. Additionally, if metadata is being parsed, look for data parsing/memory corruption etc. issues there.

  • Argument 4: RPC call argument object. This object will be of the input type specified by the .proto file for a given RPC. So in this example, this argument would be of type stringmanipulation::StringRequest. Generally, this is the data that the RPC will be parsing and manipulating, so any logic associated with handling this data is important to review for data parsing issues or similar that may lead to vulnerabilities.

  • Argument 5: RPC call return object. This object will be of the return type specified by the .proto file for a given RPC. So in this example, this argument would be of type stringmanipulation::StringReply. This is the object which is manipulated prior to return to the client.

Note: In addition to unary RPCs (a single request object and single response object), gRPC also supports streaming RPCs. In the case of unidirectional streams, i.e. where only one of the request or response is a stream, the number of arguments and order is the same, and only the type of one of the arguments will differ. For client-side streaming (i.e. the request is streamed) Argument 4 will be wrapped with a ServerReader, so in this example it will be of type ServerReader<StringRequest>. For Server side streaming (streamed response), it will be wrapped with a ServerWriter, so ServerWriter<StringReply>.

For bidirectional streams, where both the request and the response are streamed, the number of arguments differ. Rather than a separate argument for request and response, the function only has four arguments, with the forth being a ServerReaderWriter wrapping both types. In this example, ServerReaderWriter<StringRequest, StringReply>. See the gRPC documentation for more information on these wrappers. The C++ Basics Tutorial has some good examples.

Protobuf Member Accesses in C++

The classes generated by protoc for each of the input/output types defined in the .proto file are fairly simple. Scalar typed members are stored by value as member variables inside the class instance. Non-scalar values are stored as pointers to the member. The class includes (among other things) the following functions for getting and setting members:

  • .<member>(): get the value of the field with name <member>. This is applicable to all types, and will return the value itself for scalar types and a pointer to the member for complex/allocated types.
  • .set_<member>(value_to_set): set the value for a type which does not require allocation. This includes scalar fields and enums.
  • .set_allocated_<member>(value_to_set): set the value for a complex type, which requires allocation and setting of its own member values prior to setting in the request or reply. This is for composite/nested types.

The actual implementation for these functions is fairly uncomplicated, even for allocated types, and basically boils down to accessing the value of a pointer at some offset into the object whose member is being retrieved or set. These functions will not be named in a stripped binary, but are easy to spot.

The getters take the request message (in this example, request) as the sole argument, pass it through a couple of nested function calls, and eventually make an access to some offset into the message. Based on the offset, you can determine which field is being accessed, (with the help of the generated pb.h files, generation of which is covered later) and can thus identify the function and its return value.

 




 

The implementation for complex types is similar, adding a small amount of extra code to account for allocation issues.



 

Setter functions follow an almost identical structure, with the only difference being that they take the response message (in this example, reply) as the first argument and the value to set the field to as the second argument. 





And again, the only difference for complex type setters is a bit of extra logic to handle allocation when necessary.

 

Reconstructing Types

The huge amount of automatically generated code used by gRPC is a great annoyance to a prospective reverse engineer, but it can also be a great ally. Because the manner in which the .proto files are integrated into the final binary is uniform, and because the binary must include this information in some form to correctly deserialize incoming messages, it is possible in most cases to extract a complete reconstruction of the original .proto file from any software which uses gRPC for communication, whether that be a client or server.

This can be done manually with some studying up on protobuf Filedescriptors, but more than likely this will not be necessary -- someone has probably already written something to do it for you. For this guide the Protobuf Toolkit (pbtk) will be used, but a more extensive list of available software for extracting .proto structures from gRPC clients and servers will be included in the Tooling section.

Generating .proto Files

By feeding the server binary we are working with into pbtk, the following .proto file is generated.

syntax = "proto3";

package stringmanipulation;

service StringManipulation {
    rpc reverseString(StringRequest) returns (StringReply);
    rpc uppercaseString(StringRequest) returns (StringReply);
}

message innerMessage {
    int32 some_val = 1;
    string some_string = 2;
}

message complexMessage {
    innerMessage some_message = 1;
    testEnumeration innerEnum = 2;
}

message StringRequest {
    complexMessage cm = 1;
    string original = 2;
    int64 timestamp = 3;
    bool testval = 4;
    bool testval2 = 5;
    bool testval3 = 6;
}

message StringReply {
    string result = 4;
    int64 timestamp = 2;
    complexMessage cm = 3;
}

enum testEnumeration {
    ZERO = 0;
    ONE = 1;
    TWO = 2;
    THREE = 3;
    FOUR = 4;
    FIVE = 5;
}

Referring back to the original .proto example at the beginning, we can see this is a perfect match, even preserving order of RPC declarations and message fields. This is important because we can now begin to correlate vtable members with RPCs by name and argument types. However, while we know the types of arguments being passed to each RPC, we do not know how each field is ordered inside the c++ object for each type. Annoyingly, the order of member variables for the generated class for a given type appears to be correlated neither to the order of definition in the .proto file, nor to the field numbers specified.

However, auto-generated code comes to the rescue again. While the order of member variables doe not appear to be tied to the .proto file at all, it does appear to be deterministic, based on analysis of numerous gRPC binaries. protoc uses some consistent metric for ordering the fields when generating the .pb.h header files, which are the source of truth for class/structure layout for the final binary. And conveniently, now that we have possession of a .proto file, we can generate these headers.

Defining Message Structures

The command protoc --cpp_out=. <your_generated_proto_file>.proto will compile the .proto file into the corresponding pb.cc and pb.h files. Here we're interested in the headers. There is quite a bit of cruft to sift through in these files, but the general structure is easy to follow. Each type defined in the .proto file gets defined as a class, which includes all methods and member variables. The member variables are what we are interested in, since we need to know their order and C++ type in order to map out structures for each of them while reverse engineering.

The member variable declarations can be found at the very bottom of the class declaration, under a comment which reads @@protoc_insertion_point(class_scope:<package>.<type name>)

  // @@protoc_insertion_point(class_scope:stringmanipulation.StringRequest)
 private:
  class _Internal;

  template <typename T> friend class ::PROTOBUF_NAMESPACE_ID::Arena::InternalHelper;
  typedef void InternalArenaConstructable_;
  typedef void DestructorSkippable_;
  ::PROTOBUF_NAMESPACE_ID::internal::ArenaStringPtr original_;
  ::stringmanipulation::complexMessage* cm_;
  ::PROTOBUF_NAMESPACE_ID::int64 timestamp_;
  bool testval_;
  bool testval2_;
  bool testval3_;
  mutable ::PROTOBUF_NAMESPACE_ID::internal::CachedSize _cached_size_;
  friend struct ::TableStruct_stringmanipulation_2eproto;

The member fields defined in the .proto file will always start at offset sizeof(size_t) * 2 bytes from the class object, so 8 bytes for 32 bit, and 16 bytes for 64 bit. Thus, for the above class (StringRequest), we can define the following struct for static analysis:

// assuming 64bit architecture, if 32bit pointer sizes will differ
struct StringRequest __packed {
    0x00: uint8_t dontcare[0x10];
    0x10: void *original_string; 
    0x18: struct complexMessage *cm; // This will also need to be defined, 
                                     // the same technique inspecting the pb.h file applies
    0x20: int64_t timestamp;
    0x28: uint8_t testval;
    0x29: uint8_t testval2;
    0x2a: uint8_t testval3;
};

Note: protobuf classes are packed, meaning there is no padding added between members to ensure 4 or 8 byte alignment. For example, in the above structure, the three bools will be found one after another at offsets 0x28, 0x29, and 0x2a, rather than at 0x28, 0x2c, and 0x30 as would be the case with 4 bit aligned padding. Ensure that your reverse engineering tool knows this when defining structs.

Once structures have been correctly defined for each of the types, it becomes quite easy to determine what each function and variable is. Take the first example for the Protobuf Member Accesses section, now updated to accept an argument of type StringRequest:

Its clear now that this function is the getter for the StringRequest.original, a string. Applying this technique to the rest of the RPC, changing function and variable names as necessary, produces fairly easy to follow decomplication:



From here, it is as simple as standard static analysis to look for any vulnerabilities which might be exploited in the server, whether it be in incoming data parsing or something else.

Active Testing

Most of the active testing/dynamic analysis to be performed re: gRPC is fairly self explanatory, and is essentially just fuzzing/communicating over a network protocol. If the .proto files are available (or the server or client binary is available, and thus the .proto files can be generated), they can be provided to a number of existing gRPC tooling to communicate with the server. If no server, client, or .protos are available, it is still possible to reconstruct the .proto to some extend via captured gRPC messages. Resources for various techniques and tools for actively testing a gRPC connection can be found in the Tooling section below.

Tooling

  • Protofuzz - ProtoFuzz is a generic fuzzer for Google’s Protocol Buffers format. Takes a proto specification and outputs mutations based on that specification. Does not actually connect to the gRPC server, just produces the data.

  • Protobuf Toolkit - From the pbtk README:

pbtk (Protobuf toolkit) is a full-fledged set of scripts, accessible through an unified GUI, that provides two main features:

  1. Extracting Protobuf structures from programs, converting them back into readable .protos, supporting various implementations:

    • All the main Java runtimes (base, Lite, Nano, Micro, J2ME), with full Proguard support,
    • Binaries containing embedded reflection metadata (typically C++, sometimes Java and most other bindings),
    • Web applications using the JsProtoUrl runtime.
  2. Editing, replaying and fuzzing data sent to Protobuf network endpoints, through a handy graphical interface that allows you to edit live the fields for a Protobuf message and view the result.

  • grpc-tools/grpc-dump - grpc-dump is a grpc proxy capable of deducing protobuf structure if no .protos are provided. Can be used similarly to mitmdump. grpc-tools includes other useful tools, including the grpc-proxy go library which can be used to write a custom proxy if grpc-dump does not suit the needs of a given test.

  • Online Protobuf Decoder - Will pull apart arbitrary protobuf data (without requiring a schema), displaying the hierarchical content.

  • Awesome gRPC - A curated list of useful resources for gRPC.

Resources

Tuesday, April 6, 2021

Watch Your Step: Research Into the Concrete Effects of Fault Injection on Processor State via Single-Step Debugging

by Ethan Shackelford, Associate Security Consultant at IOActive

Fault injection, also known as glitching, is a technique where some form of interference or invalid state is intentionally introduced into a system in order to alter the behavior of that system. In the context of embedded hardware and electronics generally, there are a number of forms this interference might take. Common methods for fault injection in electronics include:

  • Clock glitching (errant clock edges are forced onto the input clock line of an IC)

  • Voltage fault injection (applying voltages higher or lower than the expected voltage to IC power lines)

  • Electromagnetic glitching (Introducing EM interference)

This article will focus on voltage fault injection, specifically, the introduction of momentary voltages outside of normal operating conditions on the target device's power rails. These momentary pulses or drops in input voltage (glitches) can affect device operation, and are directed with the intention of achieving a particular effect. Commonly desired effects include "corrupting" instructions or memory in the processor and skipping instructions. Previous research has shown that these effects can be predictably achieved [1], as well has provided some explanation as to the EM effects (caused by the glitch) which might be responsible for the various behaviors [2].

However, a gap in published research exists in correlating glitches (and associated EM effects) with concrete changes in state at the processor level (i.e. what exactly occurs in the processor at the moment of a glitch that causes an instruction to be corrupted or skipped, an incorrect branch to be taken, etc.). This article seeks to quantify and qualify the state of a processor before, during, and after an injected fault, and describe discrete changes in markers such as registers including general registers as well as control registers such as $pc and $lr, memory, and others.

 

Past Research and Thanks

Special thanks to the folks at Toothless Consulting, whose excellent series of blog posts [3] were my introduction to fault injection, and the inspiration for this project. Additional thanks to Chris Gerlinsky, whose research into embedded device security and in particular his talk [4] on breaking CRP on the LPC family of chips was an invaluable resource during this project.


Test Setup

The target device chosen for testing was the NXP LPC1343, an ARM Cortex-M3 microcontroller. In order to control the input target voltage and coordinate glitches, the Digilent Arty A7 development board was used, built around the Xilinx Artix 7 FPGA. Custom gateware was developed for the Arty board, in order to facilitate control and triggering of glitches based on a variety of factors. For the purposes of this article, the two main triggers used are a GPIO line which goes high/low synchronized to certain device operations, and SWD signals corresponding to a "step" event. The source code for the FPGA gateware is available here.

In order to switch between the standard voltage level (Vdd) and the glitch voltage level (Vglitch), a Maxim MAX4617 Multiplexer IC was used. It is capable of switching between inputs in as little as 10ns, and is thus suitable for producing a glitch waveform on the LPC 1343 power rails with sufficient accuracy and timing. 




As illustrated in the image above, the Arty A7 monitors a “trigger” line, either a GPIO output from the target or the SWD lines between the target and the debugger, depending on the mode of operation. When the expected condition is met, the A7 will drive the “glitch out” according to a provided waveform specifier, triggering a switch between Vdd and Vglitch via the Power Mux Circuit and feeding that to the target Vcore voltage line. A Segger J-Link was used to provide debug access to the target, and the SWD lines are also fed to the A7 for triggering.

In order to facilitate triggering on arbitrary SWD commands, a barebones SWD receiver was implemented on the A7. The receiver parses SWD transactions sniffed from the bus, and outputs the deserialized header and transaction data, values which can then be compared with a pre-configured target value. This allows for triggering of the glitchOut line based on any SWD data – for example, the S TEP and RESUME transactions, providing a means of timing glitches for single-stepped instructions.


 

Prior to any direct testing of glitches performed while single-stepping instructions, observing glitches during normal operation and the effects they cause is helpful to provide a base understanding, as well as to provide a platform for making assumptions which can be tested later on. To provide an environment for observing the results of glitches of varied form and duration, program execution consists of a simple loop, incrementing and decrementing two variables. At each iteration, the value of each variable is checked against a known target value, and execution will break out of the loop when either one of the conditions is met. Outside of the loop, the values are checked against expected values and those values are transmitted via UART to the attacking PC if they differ.

Binary Ninja reverse engineering software was used to provide a visual representation of the compiled C. Because the assembly presented represents the machine code produced after compiling and linking, we can be sure that it matches the behavior of the processor exactly (ignoring concepts like parallel execution, pipelining etc. for now), and lean on that information when making assumptions about timing and processor behavior with regard to injecting faults. 

 


 

Though simple, this environment provides a number of interesting targets for fault injection. Contained in the loop are memory access instructions (LDR, STR), arithmetic operations (ADDS, SUBS), comparisons, and branching operations. Additionally, the pulse of PIO2_6 provides a trigger for the glitchOut signal from the FPGA – depending on the delay applied to that signal, different areas/instructions in the overall loop may be targeted. By tracing the power consumption of the ARM core with a shunt resistor and transmission line probe, execution can be visualized. 

The following waveform shows the GPIO trigger line (blue), and the power trace coming from the LPC (purple). The GPIO line goes high for one cycle then low, signaling the start of the loop. What follows is a pattern which repeats 16 times, representing the 16 iterations of the loop. This is bounded on either side by the power trace corresponding to the code responsible for writing data to the UART, and branching back to the start of the main loop, which is fairly uniform. 

 

 

 

We now have: 

  1. A reference of the actual instructions being executed by the processor (the disassembly via Binary Ninja)
  2. A visual representation of that execution, viewable in real time as the processor executes (via the power trace)
  3. A means of taking action within the system under test which can be calibrated based on the behavior of the processor (the FPGA glitcher).

Using the above information, it is possible to vary the offset of the glitch from the trigger, and (roughly) correlate that timing to a given instruction or group of instructions being executed. For example, by triggering a glitch sometime during the sixth repetition of the pattern on the power trace, we can observe that that portion of the power trace appears to be cut off early, and the values reported over UART by the target reflect some kind of misbehavior or corruption during the sixth iteration of the loop.

 

 



So far, the methodology employed has been in line with traditional fault injection parameter search techniques – optimize for visibility into a system to determine the most effective timing and glitch duration using some behavior baked into device operation (here, a GPIO line pulsing). While this provides coarse insight into the effects of a successfully injected fault (for the above example we can make the assumption that an operation at some point during the sixth iteration of the loop was altered, any more specificity is just speculation), it may have been a skipped load instruction, a corrupted store, or a flipped compare among many other possibilities.

To illustrate this point, the following is the parsed, sorted, and counted output of the UART traffic from the target device, after running the glitch for a few thousand iterations of the outer loop. The glitch delay and duration remained constant, but resulted in a fairly wide spread of discreet effects on the state of the variables at the end of the loop. Some entries are easy to reason about, such as the first and most common result: B is the expected value after six iterations (16 - 6 = 10), but A is 16, and thus a skipped LDR or STR instruction may have left the value 16 in the register placed there by previous operations. However, other results are harder to reason about, such as the entries containing ascii text, or entries where the variable with the incorrect value doesn't appear to correlate to the iteration number of the loop.




This level of vagueness is acceptable in some applications of fault injection, such as breaking out of an infinite loop as is sometimes seen in secure boot bypass techniques. However, for more complex attacks where a particular operation needs to be corrupted in just the right way greater specificity, and thus a more granular understanding, is a necessity. 

And so what follows is the novel portion of the research conducted for this article: creating a methodology for targeting fault injection attacks to single instructions, leveraging debug interfaces such as SWD/JTAG for instruction isolation and timing. In addition to the research value offered by this work, the developed methodology may also have practical applications under certain, not uncommon circumstances regarding devices in the wild as well, which will be discussed in a later section. 

 

A (Very) Quick Rundown of the SWD protocol

SWD is a debugging protocol developed by ARM and used for debugging many devices, including the Cortex-M3 core in the LPC 1343 target board. From the ARM Debug Interface Architecture Specification ADIv5.0 to ADIv5.2

The Arm SWD interface uses a single bidirectional data connection and a separate clock to transfer data synchronously. An operation on the wire consists of two or three phases: packet request, acknowledgement response, and data transfer.

Of course, there's more to it than that, but for the purposes of this article all we're really interested in is the data transfer, thanks to a quirk of Cortex-M3 debugging registers: halting, stepping, and continuing execution are all managed by writes to the Debug Halting Control and Status Register (DHCSR). Additionally, writes to this register are always prefixed with 0xA05F, and only the low 4 bits are used to control the debug state -- [MASKINTS, STEP, HALT, DEBUGEN] from high to low. So we can track STEP and RESUME actions by looking for SWD write transaction with the data 0xA05F0001 (RESUME) and 0xA05F000D (STEP).

 



Because of the aforementioned bidirectionality of the protocol, it isn't as easy as just matching a bit pattern: based on whether a read or write transaction is taking place, and which phase is currently underway, data may be valid on either clock edge. Beyond that, there are also turnaround periods that may or may not be inserted between phases, depending on the transaction. The simplest solution turned out to be just implementing half of the protocol, and discarding the irrelevant portions keeping only the data for comparison. The following is a Vivado ILA trace of the-little-SWD-implementation-that-could successfully parsing the STEP transaction sniffed from the SWD lines.

 


Isolating Instructions

So, by single stepping an instruction and sniffing the SWD lines from the A7, it is possible to trigger a glitch the instant (or very close to, within 10ns) the data is latched by the target board's debug machinery. Importantly, because the target requires a few trailing SWCLK cycles to complete whatever actions the debug probe requires of it, there is plenty of wiggle room between the data being latched and the actual execution of the instruction. And indeed, thanks to the power trace, there is a clear indication of the start of processor activity after the SWD transaction completes.





As can be seen above, there is a delay of somewhere in the neighborhood of 4us, an eternity at the 100MHz of the A7. By delaying the glitch to various offsets into the "bump" corresponding to instruction execution, we can finally do what we came here to do: glitch a single-stepping processor.

In order to produce a result more interesting than "look, it works!" a simple script was written to manage the behavior of the debugger/processor via OpenOCD. The script has two modes: a "fast" mode, which single steps as fast as the debugger can keep up with used for finding the correct timing and waveform for glitches, and a (painfully) "slow" mode, which inspects registers and the stack before and after each glitch event, highlighting any unexpected behavior for perusal. Almost immediately, we can see some interesting results glitching a load register instruction in the middle of the innermost loop -- in this case a LDR r3, [sp] which loads the previous value of the A variable into r3, to be incremented in the next instruction.

 




We can see that nothing has changed, suggesting that the operations simply didn't occur or finish -- a skipped instruction. This reliably leads to an off-by-one discrepancy in the UART output from the device: either A/B ends up 1 less/greater than it should be at the end of the loop, because one of the inc/dec operations was acting on data which is not actually associated with the state of the A variable.

Interestingly, this research shows that the effectiveness of fault injection is not limited only to instructions which access memory (LDR, STR, etc.), but can also be used to affect the execution of arithmetic operations, such as ADDS and CMP, or even branch instructions (though whether the instructions themselves are being corrupted or if the corruption is occurring on the ASPR by which branches are decided requires further study). In fact, no instruction tested for this article proved impervious to single-step-glitching, though the rate of success did vary depending on the instruction.

 

 


 

We see here the CMP instruction which determines whether or not A matches the expected 0x10 being targeted. We see that the xPSR is not updated (meaning the zero flag is not set and as far as the processor is concerned, the CMP'd values did not match, and so the values of A and B are sent via UART. However, because it was the CMP instruction itself being glitched, the reported values are the correct 0x10 and 0. Interestingly, we see that r1 has been updated to 0x10, the same immediate value used in the original CMP. Referring to the ARMv7 Architecture Reference Manual, the machine code for CMP r3, 0x10 should be 0x102b. Considering possible explanations for the observed behavior, one might consider an instruction like LDR or MOVS, which could have moved the value into the r1 register. And as it turns out, the machine code for MOVS r1, 0x10 is 0x1021, not too many bits away from the original 0x102b!

While that isn't the definitive answer as to cause for the observed behavior, its a guess well beyond the level of information available via power trace analysis and similar techniques alone. And if it is correct, we not only know what generally occurred to cause this behavior, but can even see which bits specifically in the instruction were flipped for a given glitch delay/duration.

Including all the script output for every instruction type in this article is a bit impractical, but for the curious the logs detailing registers/stack before and after each successful glitch for each instruction type will be made available in the git repo hosting the glitcher code. 


Practical Applications

I know what you're thinking. 

"If you have access to a device via JTAG/SWD debugger, why fuss with all the fault injection stuff? You can make the device do anything you want! In fact, I recently read a great blog post where I learned how to take advantage of an open JTAG interface!"

However, there is a very common configuration for embedded devices in the wild to which the research presented here could prove useful. Many devices, including the STM32 series (such as the DUT for this article), implement a sort of "high but not the highest possible" security mode, which allows for limited debugging capabilities, but prevents reads and writes to certain areas of memory, rendering the bulk of techniques for leveraging an open JTAG connection ineffective. This is chosen over the more secure option of disabling debugging entirely because the latter leaves no option for fixing or updating device firmware (without a custom bootloader), and many OEMs may choose to err towards serviceability rather than security. In most such implementations though, single stepping is still permitted!

In such a scenario, aided by a copy of device firmware, a probing setup analogous the one described here, or both, it may be possible to render an otherwise time-consuming and tedious attack nearly trivial, stripping away all the calibration and timing parameterization normally required for fault injection attacks. Need to bypass secure boot on a partially locked down device? No problem, just break on the CMP that checks the return value of is_secureboot_enabled().

 

Future Research

Further research is required to really categorize the applicability of this methodology during live testing, but the initial results do seems promising. Further testing will likely be performed on more realistic/practical device firmware, such as the previously mentioned secure boot scenario.

Additionally and more immediately, part two of this series of blog posts will continue to focus on developing a better understanding of what happens within an integrated circuit, and in particular a complex IC such as a CPU, when subjected to fault injection attacks. I have been putting together an 8-bit CPU out of 74 series discreet components in my spare time over the last few months and once complete it will make the perfect target for this research: the clock is controllable/steppable externally, and each individual module (the bus, ALU, registers, etc.) are accessible by standard oscilloscope probes and other equipment. 

This should allow for incredibly close examination of system state under a variety of conditions, and make transitory issues caused by faults which are otherwise difficult to observe (for example an injected fault interfering with the input lines of the ALU but not the actual input registers) quite clear to see.

Stay tuned!

 

Video Demonstration

 


References

[1] J. Gratchoff, "Proving the wild jungle jump," University of Amsterdam, Jul. 2015

[2] Y. Lu, "Injecting Software Vulnerabilities with Voltage Glitching," Feb. 2019

[3] D. Nedospasov, "NXP LPC1343 Bootloader Bypass,"

Breaking Code Read Protection on the NXP LPC-family Microcontrollers," Jan. 2017, https://recon.cx/2017/brussels/talks/breaking_crp_on_nxp.html

[5] A. Barenghi, G. Bertoni, E. Parrinello, G. Pelosi, "Low Voltage Fault Attacks on the RSA Cryptosystem," 2009