SemLoc: Semantic Spectra for Fault Localization

SemLoc workflow diagram — SemLoc infers semantic constraints anchored to program structures, builds a constraint-by-test violation matrix (semantic spectrum), and applies counterfactual verification to localize faults causally.

Abstract

Fault localization identifies program locations responsible for observed failures. Existing techniques rank suspicious code using syntactic spectra--signals derived from execution structure such as statement coverage, control-flow divergence, or dependency reachability. These signals collapse for semantic bugs, where failing and passing executions traverse identical code paths and differ only in whether semantic intent is honored. Recent LLM-based approaches extend localization with semantic reasoning, but they produce stochastic, unverifiable outputs: responses vary across runs, cannot be systematically cross-referenced against runtime evidence from multiple tests, and provide no mechanism for distinguishing cascading downstream symptoms from root-cause violations.

We present SemLoc, a fault localization framework built on structured semantic grounding. Our key novelty is that SemLoc converts free-form LLM semantic reasoning into a closed intermediate representation that binds each inferred property to a specific, typed program anchor, making it checkable at runtime and attributable to a concrete program structure. SemLoc executes instrumented programs and constructs a semantic violation spectrum, which is a constraint-by-test violation matrix, from which suspiciousness scores are derived analogously to coverage-based fault localization. A counterfactual verification step further prunes over-approximate constraints, distinguishing primary causal violations from cascading effects and providing a structured, verifiable semantic explanation beyond line-level rankings.

Results

42.8%

Acc@1

Top-1 fault localization accuracy

77%

Acc@5

Top-5 fault localization accuracy

7.3%

% Suspicious

Lines flagged (vs. 44% for SBFL)

6×

Reduction

Fewer suspicious lines vs. SBFL

RQ1 accuracy comparison across methods — Acc@1 / Acc@3 / Acc@5 comparison across SemLoc, SBFL (Ochiai, Tarantula), Delta Debugging, and ablation configurations.

How It Works

1

Semantic Constraint Inference

An LLM analyzes the buggy program and failing tests to infer typed behavioral constraints — bound to specific program anchors (entries, def-use sites, loop boundaries, return points) in a closed, checkable schema.

2

Program Instrumentation

SemLoc uses a tree-sitter AST to inject lightweight runtime checks at each constraint anchor. Instrumented programs behave identically to originals but emit per-test violation events.

3

Semantic Spectrum Analysis

Running the full test suite produces a constraint-by-test violation matrix. Suspiciousness scores are computed via the Ochiai coefficient — analogous to coverage-based SBFL, but operating on semantic properties.

4

Counterfactual Verification

Minimal hypothetical patches toggle each violated constraint and re-execute the test suite. Primary causal violations are distinguished from cascading downstream effects, yielding a causally-grounded final ranking.

Quick Start

Install

pip install git+https://github.com/jerry729/semloc

Try the demo (no API key needed)

semloc demo --skip-llm

Runs the full pipeline on a bundled benchmark example using pre-computed constraints and prints ranked suspicious lines with the ground truth fault highlighted.

Localize a bug in your own program

export OPENAI_API_KEY="..."   # or GEMINI_API_KEY / Vertex AI credentials

semloc locate \
  --program  my_module.py \
  --tests    test_my_module.py \
  --model    gpt-4o \
  --out-dir  run1

Run individual steps

# Re-run only scoring (reuse prior violations)
semloc locate --program my_module.py --tests test_my_module.py \
              --out-dir run1 --steps score,locate

# Use pre-computed constraints, skip LLM
semloc locate --program my_module.py --tests test_my_module.py \
              --constraints constraints.json --out-dir run1

See the full documentation for the complete pipeline reference, model selection guide, and experiment reproduction scripts.

SemFault-250 Benchmark

250 Python programs each containing a single semantic fault, spanning five application domains:

💰 Financial Systems

⏱️ Rate Limiting

🗄️ Cache Management

🌊 Stream Processing

🤖 ML Operations

Each program comes with a pytest test suite (passing + failing tests) and a ground truth fault line. The benchmark is included in the repository under benchmark/.

Citation

@misc{yang2026semlocstructuredgroundingfreeform,
      title={SemLoc: Structured Grounding of Free-Form LLM Reasoning for Fault Localization},
      author={Zhaorui Yang and Haichao Zhu and Qian Zhang and Rajiv Gupta and Ashish Kundu},
      year={2026},
      eprint={2603.29109},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2603.29109},
      }

SemLoc

Structured Grounding of Free-Form LLM Reasoningfor Fault Localization