← Back to blog

    The 33 Checks Every CLAUDE.md Should Pass

    When we cataloged the failure modes of CLAUDE.md files at scale, we expected to find ten or fifteen. We ended up with thirty-three. They all share one property: each one is a specific, falsifiable problem the author probably didn't notice, and the agent silently routed around. This post walks through the thirty-three checks AgentLint runs, grouped by the kind of damage they cause, and explains why each one earned its place.

    Why thirty-three and not five

    The first version of the linter had nine checks. We thought that covered it. Then we ran it against the next thirty CLAUDE.md files we could find — open-source repos, our own internal projects, files contributors had emailed us — and watched the same patterns appear in shapes the original nine couldn't catch. Vague rules wearing the costume of specific rules. Specific rules contradicted by their own examples. Section headings that promised one thing and delivered another. By the time the catalog stabilized, it had thirty-three checks across five dimensions. Not because thirty-three is a magic number, but because that's where new checks stopped catching anything the old ones missed.

    The five dimensions, in order of how much pain they cause when violated: correctness, specificity, structure, enforcement, and hygiene. Most CLAUDE.md files fail at specificity first. A few fail at correctness so badly that the agent ignores the file entirely.

    Correctness checks (8 of 33)

    These are the checks that ask: does the file say true things about the project as it exists right now?

    The first is stale path references: every file path mentioned in CLAUDE.md must resolve. If the file says "see scripts/deploy.sh," scripts/deploy.sh has to exist. The check is a one-liner that greps every quoted path and shells out to test -e. It catches the most embarrassing class of drift: rules that reference files deleted six commits ago.

    The second is stale command references: every shell command mentioned must be runnable. The check verifies syntax (does shellcheck parse it) and, where possible, runs --help to confirm the binary exists. Rules like "run bun test" silently break when the project migrates to npm.

    The third is contradicting rules within the file: pairs of rules where one explicitly forbids what another explicitly requires. The checker uses a small NLI model on rule-shaped sentences and surfaces high-confidence contradictions. The most common pattern: an early rule says "all changes go through PR" and a later rule says "small fixes can go directly to main."

    The fourth is CLAUDE.md vs README contradiction: the same project documented two ways in two files. CLAUDE.md says the build command is X, README says it's Y. One of them is wrong. The check picks both files apart, normalizes the claims, and flags conflicts.

    The fifth is CLAUDE.md vs CI contradiction: rules whose enforcement is supposed to live in CI, but the CI file does not actually enforce. "Tests must pass before merge" — but the GitHub Actions workflow has no test step.

    The sixth is language pinning: if the project uses TypeScript, the file should say so. If it uses Python with uv, the file should say uv not pip. The check parses package manager files (package.json, pyproject.toml, Cargo.toml) and cross-references against the rules.

    The seventh is deprecated tooling: rules that mention tools the project has migrated away from. "Use lerna run" in a project that switched to bun workspaces. The check has a small registry of common migrations and surfaces likely cases.

    The eighth is wrong default branch: rules referencing "master" when the repo's default is "main," or vice versa.

    Specificity checks (10 of 33)

    These check whether the rules are concrete enough that the agent can act on them.

    Hedge-word density: the count of "should," "may," "consider," "if appropriate" per hundred words. Above a threshold, the file is advisory rather than directive. The fix is to rewrite hedges into specific rules.

    Unfalsifiable claims: rules of the form "write good code" or "be careful." The check flags any imperative whose verb is too abstract to verify (good, careful, clean, maintainable, robust, proper).

    Missing thresholds: rules that name a quality without quantifying it. "Tests should be fast" without a time budget. "Files shouldn't be too long" without a line limit.

    Unverifiable claims: rules that refer to the agent's intentions ("understand the code") or feelings ("feel confident"). The agent cannot verify either.

    Stack of synonyms: three or more near-synonyms in a single rule ("clear, concise, readable") which adds words but no specificity.

    Unscoped "always" / "never": absolute rules without an exception clause when the codebase obviously has exceptions.

    Missing examples: rules where a concrete example would compress the rule by half and double its compliance rate.

    Vague references to "best practices": pointer-without-target rules. "Follow JavaScript best practices" — whose best practices, which year, which document.

    Aspirational rules: rules that describe what the team wishes were true rather than what is true. "We always write tests first." If you don't, the rule is a lie.

    Blanket prohibition without rationale: rules forbidding something without explaining why. The agent will follow the rule until it discovers an edge case that seems to violate the intent — at which point, with no rationale, the agent has to guess.

    Structure checks (6 of 33)

    These ask whether the file is shaped well enough that the agent can navigate it cheaply.

    Section length cap: any single section over a configurable budget (default 80 lines). Long sections are read-skipped by both humans and agents.

    Total length cap: the file as a whole. Default 300 lines. AgentLint warns at the cap and errors well above.

    Required sections: configurable list of section headings the file must include. Defaults to the five sections from blog 01: project intro, how-we-work, language/style, operational notes, principles.

    Heading hierarchy: H1 → H2 → H3 without skipping levels. Common drift: H1 then H4 because someone copy-pasted a snippet.

    Duplicate headings: the same heading appearing twice usually means two people added the same section without seeing each other's edit.

    Orphan paragraphs: text outside any section. The check flags paragraphs that appear before the first heading or after the last heading without belonging to either.

    Enforcement checks (5 of 33)

    These ask: for the rules that are supposed to be machine-enforced, does the enforcement actually exist?

    Pre-commit hook coverage: rules that say "run lint before commit" need a .husky/pre-commit or equivalent that actually runs lint.

    CI step coverage: rules that say "tests must pass" need a CI step that runs them, with required-status protection.

    Lockfile rule coverage: if the file says "do not edit lockfiles by hand," there should be a CI guard or pre-commit that blocks lockfile edits outside npm install.

    Secret-scanning rule coverage: if the file says "no secrets in commits," there should be a .gitignore covering .env* and a secret scanner in CI.

    License-header rule coverage: if the file mandates license headers in source files, the check verifies a percentage of source files actually have them.

    Hygiene checks (4 of 33)

    The smallest dimension and the one with the lowest false-positive rate.

    TODO without owner or date: TODO comments inside CLAUDE.md without a name or date eventually become permanent.

    Trailing whitespace and mixed line endings: small but corrosive.

    Inconsistent fence style: triple-backtick code fences with inconsistent language tags or fence widths.

    Stale "as of" dates: rules with a date stamp ("as of 2024-12") that haven't been touched in over six months.

    A worked example: one file, eleven failures

    We ran the linter against an open-source repo with a 240-line CLAUDE.md. The report came back with eleven distinct findings. Three were correctness (one stale path, two stale commands), four were specificity (three hedge clusters and one unfalsifiable claim), two were structure (one section over the length cap, one heading hierarchy skip), one was enforcement (a "tests must pass" rule with no CI step), and one was hygiene (a stale "as of 2024-09" stamp). Fixing all eleven took the maintainer about forty minutes and removed about sixty lines from the file. The agent's behavior on the project, measured by a small benchmark of agent-completed PRs, became visibly more consistent the next week.

    The point of the linter is not to score CLAUDE.md files. It is to remove the most common failure modes before the agent has to navigate them. A file that passes all thirty-three checks is not necessarily a great CLAUDE.md — but it is unlikely to silently mislead an agent. That is the bar.

    Where the catalog goes next

    Thirty-three is the current number, but we add checks as new patterns emerge. The next two we are testing are around AGENTS.md (which is becoming the multi-tool successor to CLAUDE.md) and around .cursor/rules files (which have their own failure modes specific to Cursor's rule-loading semantics). If you find a pattern that AgentLint doesn't catch and you think it should, the contribution path is one issue, one rule definition, one test fixture. The catalog grows by encountering files in the wild — and the more files we run it against, the better it gets.

    Related posts