Filtering events

Compose declarative event filters from small builders over named junk categories.

The parser is deliberately non-lossy: it hands you every event, including the structural noise, synthetic turns, and sidechains that most consumers want gone. Filtering is your job, and cc_transcript gives you a model for expressing it as data rather than as a hand-rolled loop.

The central type is a FilterSpec: filters as data — an ordered list of Clause rules. Each clause pairs a predicate (a condition on one event) with an action (DROP or TAG). You rarely write clauses by hand; instead you compose a spec from small builders (keep_only, drop_junk, drop_short, …) that each return a frozen clause. Because the spec is plain data, the same spec works at two moments: apply_spec filters events you already hold, as this guide does, and a spec passed to the parser is serialized to JSON and executed inside Rust, dropping events before a Python object is ever built. This guide covers the builder vocabulary and the spec model; for the engine itself and how its correctness is pinned, see The Rust engine.

Every cell below runs against the same parsed fixture, so parse it first:

from cc_transcript import parse_events_from_bytes

TRANSCRIPT = b"""\
{"type":"user","uuid":"u1","sessionId":"sess-1","timestamp":"2026-01-02T03:04:05.000Z","message":{"role":"user","content":"fix the failing test"}}
{"type":"assistant","uuid":"a1","sessionId":"sess-1","timestamp":"2026-01-02T03:04:09.000Z","message":{"role":"assistant","model":"claude-opus-4-7","stop_reason":"end_turn","content":[{"type":"text","text":"Fixed it - the off-by-one in the loop bound is gone."}]}}
{"type":"user","uuid":"u2","sessionId":"sess-1","timestamp":"2026-01-02T03:04:20.000Z","message":{"role":"user","content":"<system-reminder>Background context, not written by the user.</system-reminder>"}}
{"type":"user","uuid":"u3","sessionId":"sess-1","timestamp":"2026-01-02T03:04:30.000Z","message":{"role":"user","content":"thanks"}}
{"type":"system","uuid":"s1","sessionId":"sess-1","timestamp":"2026-01-02T03:04:31.000Z","subtype":"stop_hook_summary","content":"hook ran"}
"""

events = parse_events_from_bytes(TRANSCRIPT)
[type(e).__name__ for e in events]
# -> ['UserEvent', 'AssistantEvent', 'UserEvent', 'UserEvent', 'SystemEvent']

['UserEvent', 'AssistantEvent', 'UserEvent', 'UserEvent', 'SystemEvent']

Five events: a real user request, an assistant reply, a <system-reminder> injection, a one-word “thanks” ack, and a system stop-hook entry. We will keep the two that carry signal and drop the other three.

Build a spec from the core drops

build_spec flattens builder fragments into a FilterSpec; apply_spec runs it over an event stream and yields survivors. Three builders cover most of what you need:

from cc_transcript import apply_spec, build_spec, drop_junk, drop_short, keep_only

spec = build_spec(keep_only("user", "assistant"), drop_junk("structural"), drop_short(2))
kept = list(apply_spec(events, spec))
[(type(e).__name__, getattr(e, "text", "")) for e in kept]
# -> [('UserEvent', 'fix the failing test'),
#     ('AssistantEvent', 'Fixed it - the off-by-one in the loop bound is gone.')]

[('UserEvent', 'fix the failing test'),
 ('AssistantEvent', 'Fixed it - the off-by-one in the loop bound is gone.')]

Each clause removes exactly one line of the fixture:

keep_only("user", "assistant") drops the system stop-hook entry s1 — its kind is not in the allow-set.
drop_junk("structural") drops the <system-reminder> user line u2 — its text matches the structural junk category.
drop_short(2) drops the one-word “thanks” user line u3 — at most two words. The four-word u1 (“fix the failing test”) survives, and assistant turns are untouched (drop_short defaults to users only).

What is left is the request and the reply: the semantic content. Note that clause order never changes the keep/drop result — keep is a pure existential OR over the DROP clauses, so a spec is the set of rules, not a sequence of steps.

The builder vocabulary

Every builder returns a frozen Clause (or a tuple of them) and lives at the top level of cc_transcript. The full set:

from cc_transcript import (
    keep_only, drop_synthetic, drop_empty, drop_sidechain, drop_meta_flag,
    drop_compacted, drop_entrypoints, drop_junk, drop_phrases, drop_short, build_spec,
)

keep_only("user", "assistant")          # drop every event whose kind is not listed
drop_synthetic()                        # drop assistant turns with the <synthetic> model
drop_empty(only_from=USERS)             # drop blank-text events of one kind (only_from required)
drop_sidechain(except_assistants=False) # drop sidechain events; True keeps assistant sidechains
drop_meta_flag("is_meta", only_from=…)  # drop events whose EntryMeta boolean flag is set
drop_compacted()                        # drop compaction-summary + transcript-only entries (a tuple)
drop_entrypoints({"resume", "vscode"})  # drop events whose meta.entrypoint is in the set
drop_junk("structural", "interrupt")    # drop text matching any named JUNK_CATEGORIES group
drop_phrases(TRIVIAL_ACK_SET)           # drop events whose normalized text is one of the phrases
drop_short(3)                           # drop events with at most N whitespace-split words
build_spec(*fragments)                  # flatten Clause / tuple[Clause, ...] fragments into a spec

Two shapes to keep in mind. First, the text-content drops — drop_junk, drop_phrases, and drop_short — default to only_from={"user"}, because trimming assistant prose by length or phrase is rarely what you want; drop_meta_flag and drop_entrypoints default to all kinds, and drop_empty makes only_from required (it keys “consider tool-use blocks as non-empty” off the kind). Second, drop_compacted returns a tuple of clauses rather than a single one; build_spec flattens tuples transparently, so you drop it into a composition exactly like any single-clause builder.

Named junk categories

The historical junk regex was monolithic. It is now split into named categories so a consumer can be surgical about what counts as noise:

from cc_transcript.filterspec import JUNK_CATEGORIES

sorted(JUNK_CATEGORIES)
# -> ['agent_injection', 'interrupt', 'stop_hook', 'structural']

['agent_injection',
 'command_echo',
 'continuation',
 'interrupt',
 'stop_hook',
 'structural']

structural covers framework scaffolding (<system-reminder>, command-output tags, local-command caveats); agent_injection covers teammate-message and foreign-agent banners. Those two are pure noise. The other two are not: interrupt (a user hitting stop mid-turn) and stop_hook (stop-hook feedback) both carry user pushback — a signal you usually want to keep.

That is the point of the split. drop_junk("structural", "agent_injection") strips the noise families and leaves interrupt and stop-hook messages intact unless you name them explicitly. Pass JUNK_CATEGORIES by importing it from cc_transcript.filterspec, not the top-level package.

A ready-made structural-noise spec

For the common “just remove the framework chatter” case, the package ships NOISE_SPEC — equal to build_spec(drop_junk("structural")):

from cc_transcript import NOISE_SPEC, apply_spec

[type(e).__name__ for e in apply_spec(events, NOISE_SPEC)]
# -> ['UserEvent', 'AssistantEvent', 'UserEvent', 'SystemEvent']

['UserEvent', 'AssistantEvent', 'UserEvent', 'SystemEvent']

It drops only the universal structural noise — the <system-reminder> line u2 — and keeps everything else: the request, the reply, the short “thanks” ack u3, and the system event s1. Because it never touches length, phrasing, or kind, NOISE_SPEC is a safe default that won’t accidentally discard semantic content. Layer your own policy on top when you need more.

DROP vs TAG: annotating survivors

A clause’s action is either DROP or TAG. DROP removes the event — the first matching drop wins and stops evaluation. TAG instead records a label on a surviving event and keeps going, so you can annotate without filtering. Build a spec that reuses the drops from earlier and tags the assistant turn:

from cc_transcript import FilterSpec, annotate_spec
from cc_transcript.filterspec import Action, Clause, KindIs

tag_spec = FilterSpec(clauses=(
    *spec.clauses,
    Clause(predicate=KindIs(frozenset({"assistant"})), action=Action.TAG, label="assistant-turn"),
))
[(type(e).__name__, labels) for e, labels in annotate_spec(events, tag_spec)]
# -> [('UserEvent', ()), ('AssistantEvent', ('assistant-turn',))]

[('UserEvent', ()), ('AssistantEvent', ('assistant-turn',))]

annotate_spec yields (event, labels) for every survivor, where labels are the labels of all TAG clauses that matched. The user turn collects none; the assistant turn collects assistant-turn. The per-event primitives behind these helpers — keep(event, spec) and labels_for(event, spec) — are exported too when you need to test one event at a time.

Composing a consumer policy

Builders compose into whatever policy your consumer needs. A realistic “keep the pushback, drop the noise” spec — strip structural and agent-injection noise and trivial acks, trim very short turns, but keep interrupt and stop-hook messages by never naming those categories:

from cc_transcript import build_spec, keep_only, drop_junk, drop_phrases, drop_short
from cc_transcript.filterspec import TRIVIAL_ACK_SET

pushback_spec = build_spec(
    keep_only("user", "assistant"),
    drop_junk("structural", "agent_injection"),
    drop_phrases(TRIVIAL_ACK_SET),
    drop_short(3),
)

This is the shape cc-steer composes: the library ships the primitives — the predicates, builders, and named categories — and the consumer owns the policy that combines them. For a walkthrough of designing a spec for your own use case, see Compose your own policy.