Aila is a fully AI-driven content engine — from caption to reel to publish. Here's a look at the architecture behind it.
Aila isn't a thin wrapper around a chatbot API. It's a multi-layered AI system — purpose-built, client-specific, and continuously evolving — architected and developed using agentic coding tools including Claude Code. Every layer calls AI, and the layers build on each other.
The core insight driving the architecture: social media content creation is a multi-step reasoning and production problem, not a single prompt. That requires an architecture that mirrors it — retrieval, generation, orchestration, video production, and publishing as distinct, composable layers that each do one thing well.
The foundation of Aila is a context engineering workflow — a form of Retrieval-Augmented Generation (RAG) — that grounds every piece of AI-generated content in the client's actual business reality before a single word is written.
Website & Media
Client inputs
Context Retrieval
Live web grounding
Prompt Templates
Per-platform, per-client
AI Model
Gemini · GPT
Platform Captions
FB · IG · YT · TikTok · X
Rather than prompting a model with a generic request, the system first retrieves relevant context: the client's website, current services, and the media input (photo or video) for the post. This retrieved context grounds the output in facts — the business's real voice, real offers, and real content. The result is a caption that sounds like the business, not like a chatbot.
Each client has platform-specific prompt templates engineered to produce the best possible caption for their audience and tone. Templates encode the structure, constraints, and style of each platform — what works on Facebook reads differently than what works on TikTok. Engagement data and client feedback refine the templates over time.
Layer 1 generates captions for every platform — but the Facebook post is always the first output, and a deliberate one. Before any automated video pipeline runs, the client sees the Facebook post, reviews it, and implicitly approves it by posting. This is the human-in-the-loop checkpoint: a moment of visibility and control built directly into the architecture. Client feedback at this stage — what resonated, what didn't — loops back into prompt refinements.
Once a Layer 1-generated Facebook post goes live and is approved, it becomes the trigger for Layer 2: a LangGraph-based agentic pipeline that transforms it into short-form video content for Instagram Reels, YouTube Shorts, TikTok, and X — automatically, end to end.
Facebook Post
← Layer 1 output
AI Agent Selects
LangGraph node
Build Reel
Photos or video
Generate Script
Calls Layer 1
Voice Narration
See below
Auto-Publish
IG · YT · TikTok · X
The pipeline is structured as a directed graph — nodes are either automated execution steps or LLM reasoning nodes, where a language model applies judgment at a specific decision point: selecting the right Facebook post from recent activity, or refining narration text so numbers and phrasing read naturally when spoken aloud. The graph as a whole runs autonomously end to end. Each client has their own purpose-built graph.
The Facebook post triggering Layer 2 was itself generated by Layer 1. The reel narration script is also generated by calling back into Layer 1 — so the caption, the Facebook post, and the reel narration all share the same grounding context. The content is consistent across every platform because it originates from the same intelligence layer. Facebook acts as the control and observability point between the two.
Clients consistently pushed back on AI-generated voices — even high-quality ones — because they wanted the content to sound like them. To solve this, a custom voice cloning service was built using Qwen-3-TTS, an open-source voice synthesis model from Alibaba's Qwen team.
The service runs as a Dockerized FastAPI microservice — containerized to run cleanly and independently, and callable over HTTP by any pipeline node. Containerization also isolates the heavy AI model dependencies (HuggingFace Transformers) from the rest of the environment, keeping the host system clean. The client provides a short voice recording; the model clones their voice and synthesizes natural-sounding narration for each reel. The result: content indistinguishable from the owner speaking directly to camera.
Privacy by Architecture
Built on an open-source model running on our own infrastructure. Client voice samples are never sent to a third-party API for processing or training — a deliberate choice over SaaS voice cloning alternatives.
Human Feedback at the Output Layer
Clients listen, react, and suggest refinements. AI handles the scale; client feedback guides the quality. The human-in-the-loop isn't just at the Facebook review stage — it's an ongoing part of how the system gets better.
Every tool chosen for a specific reason.
LangGraph
Agentic pipeline orchestration
Claude Code
Agentic development tool
Claude (Anthropic)
In-pipeline AI decisions
Google Gemini
Content generation (RAG)
Tavily
Live web context retrieval
Qwen-3-TTS
Open-source voice cloning
ElevenLabs
AI voice synthesis
FastAPI + Docker
Voice clone microservice
MoviePy
9:16 vertical reel production
Facebook Graph API
Post retrieval & publishing
Instagram Graph API
Reel publishing
YouTube Data API
Shorts publishing
Python
Core pipeline language
Next.js + Vercel
This website & CI/CD
Every client gets a purpose-built pipeline. Let's talk about what that looks like for yours.