Auditing Coinswap with Loupe: What an AI Security Scanner Produced in Practice

In May 2026, I was working on quality assurance for Coinswap when Spiral announced Loupe. I used Loupe to run an AI-assisted security scan on Coinswap. Coinswap was a good target for Loupe, as it is large enough and security-sensitive enough to be a meaningful test.

In this article, I am sharing my experience with Loupe. I will cover the setup experience, whether the findings were helpful, and how much time and model quota it took to scan Coinswap's repository, which has more than 75 Rust source files.

I am going to answer these questions, but first, a short introduction to Loupe.

What Loupe is, and why I tried it on Coinswap

Loupe is an open-source security scanning harness for source repositories. It is built around three main pieces:

loupe-server, which stores repositories, jobs, findings, secrets, and scan state
loupe-worker, which checks out the target repository and runs the configured scanners
loupectl, which is the operator CLI used to register repositories, register workers, trigger scans, and inspect findings

Loupe's architecture document explains the system in more detail. The short version is that Loupe runs a server and one or more workers. The server keeps repository state and findings. Workers check out repositories, run the configured scanners, and submit results back over mutual TLS.

Spiral introduced Loupe as an AI-powered vulnerability scanning effort for open-source Bitcoin projects. The motivation is important: maintainers of open-source Bitcoin software should have access to strong automated review loops to identify attack vectors more productively.

Setting up Loupe and running the Coinswap scan

The setup was manageable, but it was not a one-command experience.

As mentioned earlier, Loupe has several moving parts. You need the server and worker running. You also need certificates, a database key, worker registration, model CLI authentication, repository registration, scan configuration, and a way to export findings. Each piece has a clear purpose, but the first run still requires care.

The local setup roughly followed this flow:

loupe-server init --data-dir /path/to/loupe-data --hostname localhost

loupe-server serve \
  --db /path/to/loupe-data/loupe.sqlite \
  --server-cert /path/to/loupe-data/server.pem \
  --server-key /path/to/loupe-data/server.key \
  --ca-cert /path/to/loupe-data/ca.pem

loupectl worker register --name local-worker --out worker.json

loupe-worker --config worker.config.toml

loupectl repo add \
  --clone-url https://github.com/citadel-tech/coinswap.git \
  --no-reporting \
  --verification-enabled

loupectl repo scan <repo-id>

I used scan-only mode first. I think that was the right decision. Opening public issues before triage would have moved unreviewed scanner output directly into the project tracker. Keeping the first pass local made it possible to deduplicate reports, check reachability, and only promote the useful findings.

Once the server and worker were healthy, starting another scan was straightforward. Most of the friction came from bootstrapping. I had to keep paths, certificates, credentials, and long-running processes aligned.

My take on this: it is reasonable for people comfortable with local services and command-line tooling. For first-time users, a smoother single interface would make the experience much better.

The Coinswap run produced 72 unique candidate findings. At first, that number looked exciting. After triage, the number became less important. Several reports were duplicates. Some described a real missing check but not a reachable issue. A few did not hold up once I traced the surrounding code. The useful reports were the ones that gave me a failing test, concrete evidence of an invariant failure, or a realistic attack path.

The exported artifacts are available here:

Loupe findings for Coinswap

For model usage, I do not have a provider token export. My operational estimate is roughly five to six five-hour model allowance windows. That should not be read as an exact token count. It means the scan consumed meaningful quota, paused, and continued across multiple allowance windows.

The target snapshot had a large number of source code files, so this was not surprising to me. Loupe was running model-backed review across a full repository, not asking a model to inspect a single file.

Reading Loupe's output: structure, triage, and validation

What mattered was that every report gave me a concrete place to start.

Each report usually included:

severity
affected file and lines
explanation of the issue
proof-of-concept or test idea

These details made it easier for me to create regression tests and reproduce the behavior described by the finding. Loupe was also useful for repetitive invariant hunting. Once it identified a missing validation pattern in one place, it often looked for similar gaps elsewhere. For example, it found some missing contract validation checks in the maker module, and it performed a similar scan in the taker module as well.

The reports were not confirmed vulnerabilities by default. Triage was still needed to confirm which ones were valid findings. The severity labels were often less useful than the actual attack path. A high or medium label helped with sorting. It did not answer the real question: can this be reached, and does it matter in the protocol flow?

For this audit on Coinswap, I prioritized:

whether an untrusted party could actually use this attack vector and gain an advantage
whether the report described a missing invariant or only a suspicious pattern
whether the proposed proof made sense and seemed realistic
whether a regression test could demonstrate the behavior

The generated patches were starting points, not ready-made fixes. Some of those patches were useful to me, while others were not suitable for direct use.

What the Loupe audit produced in practice

Most of the work began after the scan finished. The real outcome was converting selected findings into actual work.

Several Loupe-assisted fixes were merged:

coinswap#901, verifying the legacy maker funding output before the taker signs
coinswap#881, hardening maker-side Taproot contract validation
coinswap#886, hardening taker-side Taproot contract validation
coinswap#879, preventing maker reboot recovery from discarding funded swapcoins
coinswap#884, validating the fidelity bond amount against the actual chain output

Loupe also helped create follow-up issues:

coinswap#882, around replaceable or mempool-only incoming contracts
coinswap#906, around trusting a peer-supplied confirmation height
coinswap#877, a backlog item for cookie-based authentication on the maker RPC control plane

Pain points, and what I would change next time

Candidate findings require real triage. Some reports overlapped. Some were variations of the same missing validation. Some had lower impact after reading the surrounding code. A few needed stronger proof before they could become issues.

Operational overhead was another pain point. Long scans, quota waits, report export, and local server/worker state all added friction. I preserved the practical quota estimate, but not exact token accounting. For comparisons across future runs, it would be useful to track scan time, model invocations, accepted findings, rejected findings, duplicates, and provider-reported token usage.

I would use Loupe again, but with a narrower process:

scan one subsystem first instead of the whole repository
deduplicate findings before opening issues
rank by reachability and impact, not scanner-reported severity alone
require either a failing test or a clear attack path before promoting a finding

That process would make the next audit less noisy and more credible.

Verdict

For someone like me working on testing and quality assurance, tools like Loupe accelerate the early stage of the work: finding places worth reviewing and giving each one enough structure to act on.

The useful result was not a clean list of bugs. It was a better backlog: some fixes got merged, some concerns became issues, and some reports were discarded after reading the surrounding code. The next run should be narrower and measured better.