Motivated by this paper on Control Flow Flattening (CFF) deobfuscation via LLM, I decided to explore the topic with current frontier models. The paper does not provide code but outlines the algorithm for the chain-of-thought methodology. However, the paper feeds the LLM models with LLVM-IR or obfuscated source code. As the authors acknowledge, this approach is unrealistic, since real-world engineering tasks do not have access to this information. For this article, we will only use machine code or decompiled pseudocode generated by the Hex-Rays decompiler.

For the initial test, I used a SquidLoader sample from this previous blogpost (SHA256: 914b1b3180e7ec1980d0bafe6fa36daade752bb26aec572399d2f59436eaa635) that features CFF-obfuscated code after peeling the initial packing layer.

System setup

The model is given access to the binary via an IDA Pro MCP, which lets it request information needed to resolve opaque CFF predicates that may rely on data not directly contained in the function itself. The system is model-agnostic to allow output comparison across different frontier models.

Target CFF analysis

SquidLoader features CFF obfuscation that should not be too challenging for an LLM to analyze, since the dispatcher’s control flow is mediated through a series of signed and unsigned integer comparisons. The obfuscated functions are self-contained: the state variable is initialized at the start of the function, and each branch concludes by assigning a new integer value that selects the next branch. There is therefore no need to obtain information external to the function to undo the CFF.

The following is an example of the CFF-obfuscated start:

CFF deobfuscation

System prompt

The LLM is given a system prompt based on the previously mentioned paper.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
You are a senior reverse engineer. Perform the numbered pipeline in ONE JSON response (no markdown):

**Step 1 — CFF analysis and structural recovery**

**1a — CFF detection:** Decide if the C-like decompilation shows **control-flow flattening (CFF)** —
typically a centralized dispatcher loop, explicit state variable, switch/cascade on state encoding original basic
blocks merged into one flat structure. Fill the `detection` object. Be conservative with `confidence: "high"` —
reserve it for unmistakable flattened dispatch patterns.

**1b — Structural deobfuscation (conditional):** If and ONLY if `detection.cff_detected` is true AND `detection.confidence`
is exactly `"high"`, recover readable control flow inside `deobfuscation` using the methodology below.
If the verdict is not high-confidence CFF, set `deobfuscation` to null and do not invent `new_code`.

Structural deobfuscation methodology for **1b** — follow an explicit phased chain-of-thought (do not hallucinate plausible logic).
Phases aligned with Algorithm 1 style recovery:
1) Identify dispatcher/state variable σ and flattened cases.
2) Rebuild a directed state-transition graph from obfuscated successors and guards φ.
3) Reconstruct readable control-flow (loops, conditionals); remove bogus/unreachable tails when PROVEN.
4) Eliminate opaque predicates ONLY when invariant outcome is logically certain for all inputs reaching that code.
5) Remove dead dispatcher artifacts and simplify; produce normalized C-looking pseudocode.

I also included instructions to mitigate hallucinations and fabricated outputs. Due to the nature of LLMs, correct output cannot be guaranteed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
**Undecodable material stays verbatim:**
If you **cannot** fully derive bytes/strings/constants from what is shown, **leave the original obfuscation in place** — the same numeric immediates,
the same operations, registers/buffers, and call shapes — instead of substituting a guess.
Never invent plausible API names, paths, URLs, or command strings from "what loaders usually do," malware tropes, ordering hints, or "usual loader" narratives.

**No context guessing (absolute):**
Do **not** invent or infer **any** fact — identifiers, strings, branches, memory effects, calling conventions beyond the code — from stereotypes,
priors, or narrative convenience. Only conclusions anchored in the decompilation **plus** explicitly permitted MCP tool output (or the authoritative pre-decoded SIMD block above) are allowed.
This binds **`new_code`, `notes`, `chain_of_thought`, and confidence**: misleading readability is worse than cluttered fidelity.

**Correctness overrides readability:**
The consumer is malware/analysis tooling — **correctness is paramount**, ahead of clean prose or plausible narratives.
Prefer cluttered-but-faithful code, refusal, lower `semantic_confidence`, or `success: false` over an elegant hallucination.
Semantic hallucinations (wrong APIs/strings, invented control flow, reordered side effects, guessed decryption output) are unacceptable.

If predicates depend on unknowable runtime data and cannot be reduced using decompilation PLUS such context,
list them under `deobfuscation.unresolved_predicates` and set `deobfuscation.success` false.

If unsure, use `deobfuscation.semantic_confidence: "medium"` or `"low"` AND `deobfuscation.success: false`.

start

Running the script against the start function produces the following output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
__int64 start()
{
  __int64 result;
  __int64 v2;
  _BYTE v3[48];
  __int64 v4;
  void (__fastcall *v5)(_QWORD, void *, void *, __int64);

  sub_7FF7C341E387(v3, 744);
  v2 = v4;

  if (v2 != 0)
  {
    // Path for valid context
    sub_7FF7C341E387(v3, 744);
    sub_7FF7C3421CE5(v3);
    sub_7FF7C3421CFC(v3);
    sub_7FF7C341DF3D(v3);

    if ((unsigned __int8)sub_7FF7C341F73B(v3) != 0)
    {
      // Execution flow if sub_7FF7C341F73B succeeded
      v5(0, &unk_7FF7C3424C55, &unk_7FF7C3424C15, 16);
      sub_7FF7C34233EE(v3);
      sub_7FF7C342383C(v3);
    }

    // Mandatory cleanup sequence observed in the dispatch transition to terminal
    sub_7FF7C3425989(v3);
    sub_7FF7C34234B3(v3);
  }

  result = 2431469957LL;
  return result;
}

The if and while statements that drove control flow have been completely removed, substantially simplifying the function. This function’s flow was not particularly complex, but it serves as a good starting point. The output was validated with a debugger and is correct.

Other functions

The following function shows a larger and more involved obfuscation. It also features XOR-encrypted stack strings:

The model was also instructed to deobfuscate the XOR-encrypted stack strings:

1
2
3
4
**Step 3 — String and data materialization** (same scope as Step 2): Hunt for **any** pattern that hides strings or payloads in stack/locals
(constant-fed XOR / SIMD such as `_mm_xor_ps` on `__m128` chunks, rolling XOR, substituted alphabets, RC4-like updates, split wide-character builders,
arithmetic-encoded bytes, other sample-specific schemes). **Where plaintext or equivalent structure follows by strict static reasoning on
the given decompilation (and optional whitelisted MCP results only)** — simplify it (e.g. clean string literal or clearer data).

All tested models failed to decode these strings, or worse, produced hallucinations based on surrounding context. This may be a prompting issue: most models were able to decode the XOR-encrypted strings correctly when given the task in isolation. After several prompt variations, the models continued to fail, so I opted to pre-process the pseudocode for _mm_xor_ps blocks, decode the data programmatically, and inject the results back into the pseudocode. With those changes, the following output was produced:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
// Generated with Gemini 3.1 Flash
__int64 __fastcall sub_7FF7C341BC96(__int64 *a1)
{
  void (__fastcall *v3)(__m128 *, __int64 *, __int64);
  unsigned int v4;
  void (__fastcall *v5)(__m128 *, char *, __int64);
  void (__fastcall *v6)(__m128 *, __int64 *, __int64);
  void (__fastcall *v7)(__m128 *, char *, __int64);
  __int64 v8;
  int v15;
  __m128 v17;
  __m128 v18;
  __m128 v19;

  a1[131] = sub_7FF7C3421CD3();
  sub_7FF7C341E387(a1 + 1, 260);
  sub_7FF7C341E387((char *)a1 + 268, 260);
  sub_7FF7C341E387(a1 + 66, 260);
  sub_7FF7C341E387((char *)a1 + 788, 260);

  /* Resolve and Expand %ProgramFiles% */
  v3 = *(void (__fastcall **)(__m128 *, __int64 *, __int64))(a1[131] + 544);
  v17 = "%ProgramFiles%"; /* materialized from SIMD-XOR */
  v3(&v17, a1 + 1, 260);

  /* Resolve and Expand %ProgramFiles(x86)% */
  v5 = *(void (__fastcall **)(__m128 *, char *, __int64))(a1[131] + 544);
  v17 = "%ProgramFiles(x8"; /* materialized from SIMD-XOR */
  v18 = "6)%"; /* materialized from SIMD-XOR */
  v5(&v17, (char *)a1 + 268, 260);

  /* Resolve and Expand %ProgramW6432% */
  v6 = *(void (__fastcall **)(__m128 *, __int64 *, __int64))(a1[131] + 544);
  v17 = "%ProgramW6432%"; /* materialized from SIMD-XOR */
  v6(&v17, a1 + 66, 260);

  /* Resolve and Get USERNAME variable */
  v7 = *(void (__fastcall **)(__m128 *, char *, __int64))(a1[131] + 552);
  v17 = "USERNAME"; /* materialized from SIMD-XOR */
  v7(&v17, (char *)a1 + 788, 260);

  v19.m128_i32[0] = 4096;
  v8 = (*(__int64 (__fastcall **)(void *, __int64, __int64))(a1[131] + 264))(
         NtCurrentTeb()->ProcessEnvironmentBlock->ProcessHeap,
         8,
         4096);
  *a1 = v8;

  /* Execution of data fetch procedure */
  v15 = sub_7FF7C34219DD(*(_QWORD *)(a1[131] + 432), 5, v8, 4096, (__int64)&v19);

  if ( v15 >= 0 )
  {
    v4 = 1;
  }
  else if ( v15 == -1073741820 ) /* STATUS_BUFFER_TOO_SMALL retry logic */
  {
    v8 = (*(__int64 (__fastcall **)(void *, __int64, __int64, _QWORD))(a1[131] + 568))(
                  NtCurrentTeb()->ProcessEnvironmentBlock->ProcessHeap,
                  8,
                  *a1,
                  v19.m128_u32[0]);
    *a1 = v8;

    if ( (int)sub_7FF7C34219DD(*(_QWORD *)(a1[131] + 432), 5, v8, v19.m128_i32[0], (__int64)&v19) >= 0 )
      v4 = 1;
    else
      v4 = 0;
  }
  else
  {
    v4 = 0;
  }

  return v4 & 1;
}

The model also added comments identifying the called functions. Debugger validation confirms both the program flow and the comments are correct.

The comments were likely inferred from context given the arguments passed to each function. The model’s reasoning output includes this:

1
2
3
4
5
6
**Validating String Expansion**

I've pinpointed the exact calls after `sub_7FF7C3421CD3` that handle environment variable expansion.
I'm confirming that the `v3`, `v5`, `v6`, and `v7` function pointers are correctly dereferenced to build the `%ProgramFiles%`, `%ProgramFiles(x86)%`, `%ProgramW6432%`,
and `USERNAME` strings using the `_mm_xor_ps` instructions and passing the resulting values.
I've confirmed that the expanded paths are passed to the resource loader.

The reasoning claims to have “confirmed” how v3v7 handle variable expansion without citing any evidence for that conclusion. The correct way to verify this would be to find where a1[131] + 544 is initialized and check what function pointer is assigned there. No additional MCP queries were issued for this request, so no such verification took place. The inference happened to be correct, but it was still a guess.

Other model outputs for comparison:

OLLVM

The SquidLoader examples are relatively straightforward, and all models were able to undo the obfuscation without major challenges. I therefore decided to also test against an open-source obfuscation framework: Obfuscator LLVM (v16).

For the test, I wrote a small C program that establishes a TCP connection to a host:

Compiling that program with OLLVM produces a 2.2 MB binary. Notably, OLLVM does not obfuscate imports or strings by default, so it is easy to identify the original main function, in this case sub_140001000. The function is a nightmare at 1,459,928 bytes with 105 basic blocks and 276 edges, it would make any reverse engineer shiver. Previous deobfuscation tests used Hex-Rays decompiler output, but Hex-Rays cannot decompile this function, as it fails with Decompilation failure: too big function. The code therefore needed to be adjusted to also handle raw assembly output. In addition, the assembly listing is too large to fit in a single request, so the model is instructed to explore the function in chunks via MCP tool calls.

Running the script with Gemini 3.1 Flash targeting that function yields the following output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
int sub_140001000()
{
    WSAData WSAData;
    int var_204;
    int var_1FC;
    int var_200;
    SOCKET s;
    struct sockaddr name;
    int var_1F4;

    var_200 = 0;
    // Initialize Winsock
    var_204 = WSAStartup(0x0202, &WSAData);
    var_1FC = var_204;

    // State check for WSAStartup (State: 0xE43BC6B6)
    if (var_1FC != 0)
    {
        // Error Path (State: 0xC666A290 -> loc_1401654EA)
        sub_1401656E0(__imp___acrt_iob_func(2), "WSAStartup failed with error: %d", var_204);
        var_200 = 1;
    }
    else
    {
        // Success Path (State: 0x4EFA3167 -> loc_14016552F)
        s = socket(2, 1, 6); /* AF_INET, SOCK_STREAM, IPPROTO_TCP */
        
        if (s == (SOCKET)-1)
        {
            // Socket error (State: 0xD449B20 -> loc_14016555D)
            sub_1401656E0(__imp___acrt_iob_func(2), "Socket creation failed with error: %d", WSAGetLastError());
            WSACleanup();
            var_200 = 1;
        }
        else
        {
            // Address Setup (State: 0x8BA349BB -> loc_1401655A7)
            name.sa_family = 2; /* AF_INET */
            *(unsigned short *)name.sa_data = htons(8080);
            
            if (inet_pton(2, "127.0.0.1", &name.sa_data[2]) <= 0)
            {
                // Invalid Address (State: 0xE5B6EB47 -> loc_1401655F2)
                sub_1401656E0(__imp___acrt_iob_func(2), "Invalid address or address family not supported.");
                closesocket(s);
                WSACleanup();
                var_200 = 1;
            }
            else
            {
                // Connect (State: 0x1E6C06F1 -> loc_14016563B)
                sub_1401B1C00("Connecting to %s:%d...", "127.0.0.1", 8080);
                
                if (connect(s, &name, 16) == -1)
                {
                    // Connection Failed (Inferred from strings)
                    sub_1401656E0(__imp___acrt_iob_func(2), "Connection failed with error: %d", WSAGetLastError());
                    closesocket(s);
                    WSACleanup();
                    var_200 = 1;
                }
                else
                {
                    // Connection Success (State: 0x4BA0933 -> loc_140165680)
                    sub_1401B1C00("Successfully connected to %s:%d!", "127.0.0.1", 8080);
                    closesocket(s);
                    WSACleanup();
                    var_200 = 0;
                }
            }
        }
    }

    var_1F4 = var_200;
    // Function Cleanup (State: 0x238D092B -> loc_1401654B7)
    return var_1F4;
}

The generated output closely matches the original source code, which is genuinely impressive. For a function of this size, the model produced a result in just 2m10s.

The model also included these notes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[detection confidence=high] High-density Control Flow Flattening (CFF) is present, characterized by a central state-variable dispatcher (var_214/var_218) initialized to 0xE43BC6B6h. The function uses constant-state comparisons and jumping to a collection of handlers, each separated by massive Mixed Boolean-Arithmetic (MBA) and junk-math blocks (e.g., the block starting at 0x1400013A0 contains over 700 instructions to simply resolve a state transition).

Dispatcher hints:
- State variable: [rsp+4D8h+var_214]
- Initial state: 0xE43BC6B6h
- Dispatch Loop: Check sequence starting at 0x140001063
- Opaque Predicate/MBA blocks: Heavy use of redundant arithmetic (add/sub/xor/imul) to obfuscate branch conditions.

The function implements a standard network client. It initializes Winsock, creates an AF_INET TCP socket, and attempts to connect to 127.0.0.1 on port 8080. Flattening was removed by tracing the state constants (e.g., 0xE43BC6B6, 0xCDC036E2, 0x1E6C06F1) to their respective API calls. Massive MBA/junk instructions between the actual logic blocks were eliminated as they serve no semantic purpose beyond obfuscating the state transition. Identifier 's' and 'name' are preserved as labeled in the stack disassembly.

CoT (LLM):
1. Identify the state variable var_214 and the entry state 0xE43BC6B6.
2. Trace 0xE43BC6B6 to loc_14000137D, which evaluates the result of WSAStartup (var_1FC).
3. If WSAStartup failed (!= 0), transition to 0xC666A290 (loc_1400013A0), which contains MBA junk leading to the error message 'WSAStartup failed'.
4. If WSAStartup succeeded (== 0), transition to 0x4EFA3167 (loc_1400293EE), leading to socket creation (loc_14016552F).
5. Subsequent states track socket success (loc_1401655A7), address parsing with inet_pton (loc_1401655d5), and the connection attempt (loc_14016563B).
6. Terminating states (loc_140165680, loc_14016555D) perform closesocket and WSACleanup before returning via var_1F4.

Code generated by Claude 4.5 Haiku for comparison:

Model selection

Model selection balances response correctness, speed, and cost. The following models were tested:

  • DeepSeek v4 Flash & v4 Pro
  • Gemini 3.1 Flash & 3.1 Pro
  • Claude 4.5 Haiku & 4.6 Sonnet

Results:

Deobfuscation success (True / False)

  v4-flash v4-pro gemini-flash gemini-pro 4.5-haiku 4.6-sonnet
start T T T T T T
sub_7FF7C341BC96 T T T T T T
OLLVM F* F* T - T -

Failed to produce a result due to validation errors after 2 attempts; however, the model’s reasoning appeared to track the function’s logic correctly.

Deobfuscation time

  v4-flash v4-pro gemini-flash gemini-pro 4.5-haiku 4.6-sonnet
start - - - - - -
sub_7FF7C341BC96 1m58s 6m55s 1m20s - - -
OLLVM 1m24s 3m35s 2m10s - 3m38s -

Execution cost

  v4-flash v4-pro gemini-flash gemini-pro 4.5-haiku 4.6-sonnet
start - - - - - -
sub_7FF7C341BC96 ~$0† $0.12 $0.05 - - -
OLLVM $0.01 $0.16 $0.09 - $0.39 -

Prices tracked via OpenRouter.

† Consumption was too low to be tracked.

Future improvements

Integrating emulation-mcp as a tool available to the model would help with in-stack string decoding. The static pre-processing approach works, but it requires a separate pass and only covers patterns the preprocessor knows about. Giving the model access to an emulation engine would let it resolve arbitrary in-stack encodings at query time, without custom preprocessing for each scheme.

A dedicated math MCP would also help. LLMs are notoriously unreliable at arithmetic, which matters a lot here: resolving opaque integer predicates, verifying state-transition constants, and validating branch conditions. Offloading those computations to a reliable evaluator would reduce errors in the state-transition graph reconstruction phase and make the model’s reasoning easier to audit.

Conclusions

Current frontier LLMs can perform meaningful CFF deobfuscation on realistic inputs (decompiled pseudocode rather than LLVM-IR) at a cost and speed that makes the approach practical for routine analysis work.

For simpler obfuscation like SquidLoader, every tested model succeeded. The OLLVM test was only run against Gemini 3.1 Flash and Claude 4.5 Haiku: once both cheaper models produced correct output, there was no compelling reason to run the pro variants and pay 3-4x more for the same result. That said, they should work fine given they already succeeded on the less complex cases. Gemini 3.1 Flash is the best pick for most tasks, with the best balance of correctness, speed, and cost across the board.

LLMs are a genuine productivity multiplier for this class of reversing task. The source code for the tooling used in this article is available on GitHub.