
This post shows a practical, easy to understand architecture for a chat based food ordering system. It uses an API gateway, microservices and Kafka as the event backbone. The design focuses on a smooth chat experience and strong correctness for payments and rider allocation. I explain the flow, the hard engineering decisions and the exact measures we used to avoid double bookings and data loss.
Why chat based orders are hard
Ordering by chat looks simple to the user. Behind the scenes, the system must create an order, take payment, find a rider, track the delivery and keep the restaurant in the loop. All these pieces must work together even when parts fail. If you do not get the data flows and concurrency right, you end up with double charges, lost events and angry customers.
The architecture in one picture
- Clients: Mobile chat, web app and rider app. All requests go through the API gateway.
- API gateway: Handles authentication, TLS and basic rate control. It is the single entry point for clients.
- Core services: User and auth, Restaurant and menu, Order, Payment, Delivery and Notification. Each service owns its data.
- Kafka event bus: All important events travel through Kafka. Order Placed, Payment Confirmed, and Driver Assigned are events others can consume.
- Data and cache: Postgres for transactions, Redis for sessions and hot cache, Elastic for search and a geo index for location. Static files on object storage and served via CDN.
- AWS Glue: AWS services are used for notifications and serverless tasks where they fit. Kafka remains the reliable event backbone.
A short story of one order
- User types order in chat. The app calls the API gateway. The gateway calls the Order service.
- Order service writes the order in Postgres and writes an outbox row in the same database transaction. This keeps the event safe.
- An outbox publisher picks the row and publishes OrderPlaced to Kafka. It then marks the outbox row as published.
- Payment service consumes the event. It calls the payment gateway. On the webhook, it writes the payment result and publishes Payment Confirmed.
- Delivery service consumes Payment Confirmed. It finds nearby riders and performs an atomic claim on the chosen rider. On success, it publishes Driver Assigned.
- Rider app streams GPS to Kafka, and Notification service picks events and sends messages to users and restaurants.
How we keep concurrency under control
- Idempotency keys: Every request and every external webhook carries a unique ID. Services store processed IDs to avoid doing the same work twice. This prevents duplicate charges and duplicate orders.
- Kafka partitioning and sequence numbers: Events for the same order use the same partition key. Each event carries a sequence number so consumers can ignore older events that arrive later.
- Outbox pattern: Write the order and the outbox record inside a single database transaction. A separate publisher sends the outbox to Kafka. This avoids losing events if the process dies after the database write.
- Optimistic concurrency: For high-throughput operations where locks hurt performance, we use version checks. Write only succeeds if the version matches. If not, we retry. This is much cheaper than long locking.
- Atomic driver claim: Assignment updates the driver row only if the status is available. If the update affects zero rows, we move on to the next candidate. This is simple and reliable.
How we recover when things go wrong
- Durable messaging and DLQ: Kafka stores events durably. For AWS queues, we use dead letter queues. When a message fails repeatedly, it lands in DLQ for inspection.
- Idempotent consumers: Consumers write processed IDs. If a message is replayed, the consumer detects prior processing and skips side effects.
- Saga orchestration: Checkout is split across payment and delivery. We use a saga to track progress and run compensations when needed, for example refund or release driver.
- Reconciliation jobs: Periodic checks compare payments to the ledger, Kafka to the database and driver assignments to active deliveries. Issues are surfaced in an ops UI.
- Manual safe replay: The ops UI lets support inspect a DLQ message, fix the cause and replay the message using the same idempotency key. This prevents duplicates.
Practical tips for engineers
- Keep the idempotency store small and fast. Use DynamoDB or a well indexed Postgres table.
- Use short TTLs for rider reservations and require a heartbeat from the rider app. If no reply, release the rider.
- Avoid cross service transactions. Use local transactions and then events for eventual consistency.
- Add a trace ID to every request and carry it across events for easy debugging.
- Build a small ops UI for DLQ inspection and safe message replay.
Alerts and checks you must add now
- Kafka consumer lag per partition and alert on spikes.
- DLQ counts and alert on new entries.
- Sagas stuck in progress and alert after the threshold.
- Orders with payment confirmed but no driver after the threshold and alert.
- Outbox publisher error rate and alert.
Final thought
Build systems that behave predictably and remain easy to operate. Use simple patterns that scale. Events decouple services. Atomic local transactions protect critical steps. Idempotency and outbox solve the hard parts of distributed systems. Add tracing and reconciliation, and you will sleep more easily.





Leave a Reply