From 0 to 1: Building an Ops Agent that Updates Your Tickets

A useful “ops” agent isn’t a chatbot that just “answers.” It’s a system that executes actions cleanly: creating and updating tickets, notifying Slack, keeping a Notion page up to date, opening or commenting on a GitLab issue/MR, and, most importantly, keeping everything consistent.

The real challenge isn’t AI. It’s orchestration, reliability, and managing side effects. And to set this up properly, having a clear methodology changes everything.

Defining the Ops Agent’s Scope

Before you code, clearly list what the agent is allowed to do.

Typical use cases

Incident: a Slack message like “P1 API down” creates a ticket, an “Incident” Notion page, a GitLab issue, and feeds updates into a Slack thread.
Support: a “bug” ticket moves to “In progress” when a GitLab issue is linked, then to “Done” when merged.
Runbook: the agent suggests steps and automatically checks off what was done in Notion.

Golden rule: one “source of truth”

Pick one central truth per object type:

Work status: often the ticket (Jira/Linear, etc.).
Documentation / post-mortem: Notion.
Dev work: GitLab issues/MRs.
Communication: Slack.

You can sync, but avoid having four places that “decide” the status.

Recommended Architecture for an Ops Agent

To make it last, think “product”: observability, auditability, security, and retries.

An event engine, not a script

The idea is to avoid a long script that regularly breaks. Tools like Make or n8n can work very well for this. A simple and robust architecture looks like:

Ingestion: webhooks (GitLab), Slack events, ticketing APIs, Notion.
Normalization: convert everything into internal events (“TicketUpdated”, “MRMerged”, “SlackMessageTaggedP1”).
Decision: rules + optionally an LLM (to classify, extract, summarize).
Execution: idempotent actions to Slack/Notion/GitLab/ticketing.
Traceability: execution logs, state, correlation.

Minimal data schema

event_id (unique) + source (slack/gitlab/notion/ticketing)
correlation_id (e.g., incident-2026-001)
entity_map (mapping table):
- incident-2026-001 ↔ ticket #123 ↔ notion page id ↔ slack thread ts ↔ gitlab issue #456
execution_log: action, timestamp, result, retry_count

This entity_map is often the heart of the system. You can store this in a database like Postgres or MongoDB, for example.

Integrations: How to Connect Slack, Notion, GitLab, and Your Tickets

Slack: the human entry point

Use slash commands (e.g., /incident P1 API down) or reactions (emoji) as triggers.
Reply in a single thread to keep noise down.
Store the thread_ts as the discussion identifier.

GitLab: dev-side automation

Saving a few minutes every time a bug happens adds up.

Webhooks: issue created/updated, MR opened/merged, pipeline status.
Best practices:
- Add a label or field like correlation_id.
- Auto-comment with a short summary + link to Notion + link to the ticket.

Notion: memory and follow-up

Create an “Incident” or “Project” page using a template.
Update structured properties (status, owner, severity) instead of writing text everywhere.
Add a “Timeline” block fed by the agent (timestamped events).

Tickets (Product-side)

The ticket should contain links to Slack/Notion/GitLab.
Status transitions must be deterministic (e.g., “MR merged” → “Ready for QA”).

The Role of the LLM

AI is great for:

classification (incident vs question vs bug)
extracting fields (severity, impacted service, customer)
summarizing a Slack thread in 5 lines
proposing a runbook checklist

But don’t delegate too many decisions to it. State logic (statuses, transitions, permissions) should remain in your code.

Classic Pitfalls to Avoid

This is where most agents “work in a demo” but break in production.

Pitfall 1: synchronization loops

Example:

GitLab sends “issue updated” → the agent updates the ticket → the ticket sends “ticket updated” → the agent updates GitLab… and it loops.

Solutions

Add a field like updated_by=agent or a tag like automation=true.
Filter incoming events that carry your signature.
Use rules like “Slack only updates the ticket,” “GitLab only updates the dev field,” etc.

Pitfall 2: duplicates (repeated events, retries, webhooks)

Webhooks can be delivered multiple times. Workers retry. Slack can resend events.

Solutions

Deduplicate by event_id (store them for 7 to 30 days).
Business-level dedupe: “if the GitLab comment already exists, don’t repost it.”
entity_map is mandatory: if the incident exists, update it, don’t recreate it.

Pitfall 3: idempotency (what separates a POC from a product)

An idempotent action can be replayed N times without unwanted side effects.

Concrete examples

“Create a Notion page” isn’t idempotent by default (it recreates).
“Update a Notion page property” can be idempotent if you target the ID.
“Post a Slack message” isn’t idempotent unless you use an “upsert” approach (store the ts and edit).

Idempotency patterns

Upsert: “if it exists, update; otherwise create.”
Idempotency key: action_key = correlation_id + action_type + target.
Compare-and-set: only modify if the target value is different.
Outbox pattern: store the intent before executing, then mark it as “done.”

Pitfall 4: event ordering

GitLab can send “pipeline failed” and then “MR merged” depending on latency and retries.

Solutions

Async processing + compute the final state.
Store a last_seen_at per source.
Recompute status from the global state, not only from the latest event.

“From 0 to 1” Checklist to Ship a Reliable Agent

Solid MVP (aim for this)

Slack trigger (command or reaction)
Create ticket + Notion page + GitLab issue
Cross-links everywhere
Single Slack thread for updates
Mapping table (correlation_id) + execution logs
Deduplication + idempotency keys

V2 (the game changer)

Automatic summaries (Slack thread → Notion)
Controlled two-way sync (no loops)
Playbooks/runbooks by incident type
Observability: metrics, alerts, dead-letter queue

Conclusion

Building an Ops agent is mostly about building a reliable synchronization system across tools. If you take loops, duplicates, and idempotency seriously from day one, you’ll avoid 80% of “phantom updates” and prematurely closed tickets.

At Fenxi Technologies, we often see the same path: a POC in 2 days, then 2 months to make the system safe and maintainable. The good news is that with a simple event-driven architecture, a mapping table, and idempotent actions, you can go from 0 to 1 without shooting yourself in the foot.