Skip to content

cgoolsby/fullStackOllama

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

96 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Full Stack AI Engineering Platform

This repository is a showcase and evolving codebase for building and orchestrating AI systems from the ground upβ€”designed by a Full Stack AI Engineer with end-to-end expertise across mathematics, data engineering, software development, and Kubernetes-based infrastructure.

🎯 Purpose

This project demonstrates a complete, production-grade architecture to:

  • Operate an LLM cluster using Ollama models.
  • Harness GPU acceleration using the NVIDIA GPU Operator on Kubernetes.
  • Use Custom Resource Definitions (CRDs) and controllers to coordinate model behaviors.
  • Create an ecosystem where multiple models can collaborate to perform higher-level tasks (question answering, summarization, classification, etc.).
  • Establish infrastructure-as-code patterns using Kustomize, Flux, and GitOps principles.

🧠 Vision

AI systems are rarely β€œone model fits all.” This project introduces a framework where specialized AI agents (LLMs), hosted as services across a Kubernetes cluster, can interoperate to complete sophisticated tasks.

Inspired by:

  • Full-stack software engineering principles
  • Multi-agent systems
  • MLOps best practices
  • Declarative infrastructure management

πŸ”§ Architecture Overview

πŸ“ Repository Structure

.
β”œβ”€β”€ infra/
β”‚   β”œβ”€β”€ cluster-iac/            # Infrastructure as Code (Terraform) for deploying an EKS cluster with requisite GPU support
β”‚   β”œβ”€β”€ base/                    # Base Kustomize configurations (Flux, GPU Operator, etc.)
β”‚   β”œβ”€β”€ overlays/                # Cluster-specific configurations
β”‚   β”œβ”€β”€ flux/                    # Flux GitOps setup
β”‚   └── monitoring/              # Prometheus/Grafana, if used
β”‚
β”œβ”€β”€ crds/                        # Custom Resource Definitions (YAML) and Go types
β”‚   β”œβ”€β”€ ollamaagent_crd.yaml    # Defines OllamaAgent behavior/contract
β”‚   β”œβ”€β”€ ollamamodeldefinition_crd.yaml
β”‚   └── taskorchestration_crd.yaml
β”‚
β”œβ”€β”€ controllers/                 # Golang operators/controllers (kubebuilder-based)
β”‚   β”œβ”€β”€ ollamaagent_controller.go
β”‚   β”œβ”€β”€ ollamamodeldefinition_controller.go
β”‚   └── taskorchestration_controller.go
β”‚
β”œβ”€β”€ ollama-operators/            # Model server orchestration logic
β”‚   β”œβ”€β”€ agent-specialization/   # Specialized agent roles (Q&A, summarizer, etc.)
β”‚   β”œβ”€β”€ service-deployments/    # Helm or Kustomize configs for model deployments
β”‚   └── collab-logic/           # Logic for inter-agent communication & orchestration
β”‚
β”œβ”€β”€ data/                        # Data pipeline logic (ETL, tokenization, chunking, etc.)
β”‚   β”œβ”€β”€ etl-pipeline/
β”‚   └── example-datasets/
β”‚
β”œβ”€β”€ api/                         # API gateway and backend logic (Go or Python)
β”‚   β”œβ”€β”€ routes/                 # Task submission endpoints
β”‚   └── orchestration/          # Converts user requests into CRs for processing
β”‚
β”œβ”€β”€ examples/                    # Example workflows and scenarios
β”‚   β”œβ”€β”€ question-answering/
β”‚   β”œβ”€β”€ summarization-pipeline/
β”‚   └── multi-model-chat/
β”‚
β”œβ”€β”€ docs/                        # Architecture diagrams and documentation
β”‚   β”œβ”€β”€ architecture.md
β”‚   β”œβ”€β”€ ollama-crd-spec.md
β”‚   └── orchestration-diagram.png
β”‚
└── README.md

πŸ—οΈ Infrastructure Stack

Layer Technology Purpose
Container Runtime containerd Lightweight, Kubernetes-native runtime
GPU Provisioning NVIDIA GPU Operator Automatically manage GPU drivers + toolkit
GitOps Flux Declarative and auditable infra delivery
K8s Package Manager Kustomize + Helm Infra and app lifecycle management
Model Hosting Ollama (on GPU nodes) LLM serving engine
Task Coordination Custom Resource Definitions (CRDs) Define and manage complex task orchestration
Monitoring Prometheus + Grafana (optional) Cluster and model performance observability

🧩 Custom Resource Definitions (CRDs)

The system uses Kubernetes CRDs to implement the A2A (Agent-to-Agent) protocol, enabling seamless communication between AI agents. Our CRDs define both the agent deployment and task orchestration aspects of the system.

OllamaAgent

A CRD for deploying and managing individual model agents that implement the A2A protocol.

apiVersion: ai.stack/v1alpha1
kind: OllamaAgent
metadata:
  name: summarizer-agent
spec:
  # Reference to the OllamaModelDefinition
  modelDefinition:
    name: summarizer-model
    version: "1.0.0"

  # Core agent configuration
  role: summarizer

  # A2A protocol implementation
  agentCard:
    capabilities:
      - summarization
      - text-analysis
    endpoint: "/api/v1/agent"
    authentication:
      type: "bearer"

  # Resource requirements
  resources:
    gpu: 1
    memory: "8Gi"
    cpu: "2"

  # A2A server configuration
  server:
    streaming: true
    pushNotifications: true
    webhookConfig:
      retryPolicy: exponential
      maxRetries: 3

  # Model-specific settings
  modelConfig:
    temperature: 0.7
    contextWindow: 4096
    responseFormat: "json"

OllamaModelDefinition

A CRD that defines how to build a custom Ollama model with specific capabilities and behaviors. When created, it triggers the build process within the cluster.

apiVersion: ai.stack/v1alpha1
kind: OllamaModelDefinition
metadata:
  name: summarizer-model
spec:
  # Base model configuration
  from: llama2

  # Model build parameters
  build:
    # System prompt defining agent behavior
    system: |
      You are a specialized summarization agent that excels at:
      1. Extracting key information from documents
      2. Creating concise summaries
      3. Identifying main themes and topics

    # Parameters for model behavior
    parameters:
      temperature: 0.7
      contextWindow: 4096
      responseFormat: json

    # Model adaptation and fine-tuning
    template: |
      {{ if .System }}{{.System}}{{ end }}

      Context: {{.Input}}

      Instructions: Create a summary that includes:
      - Main points
      - Key findings
      - Action items

      Response format:
      {{.ResponseFormat}}

    # Custom function definitions
    functions:
      - name: extract_key_points
        description: "Extract main points from the text"
        parameters:
          type: object
          properties:
            main_points:
              type: array
              items:
                type: string
            themes:
              type: array
              items:
                type: string

    # Model tags for versioning and identification
    tags:
      version: "1.0.0"
      type: "summarizer"
      capabilities: ["text-analysis", "summarization"]

    # Resource requirements for build process
    buildResources:
      gpu: 1
      memory: "16Gi"
      cpu: "4"

status:
  phase: Building # Building, Complete, Failed
  buildStartTime: "2025-04-23T13:30:00Z"
  lastBuildTime: "2025-04-23T13:35:00Z"
  modelHash: "sha256:abc123..."
  conditions:
    - type: Built
      status: "True"
      reason: "BuildSucceeded"
      message: "Model successfully built and registered"

TaskOrchestration

A CRD that manages complex task workflows between multiple agents.

apiVersion: ai.stack/v1alpha1
kind: TaskOrchestration
metadata:
  name: document-analysis
spec:
  # Task definition
  input:
    text: "Analyze and summarize this document"
    format: "text/plain"

  # A2A task workflow
  pipeline:
    - name: document-analyzer
      agentRef: analyzer-agent
      timeout: "5m"
      retries: 2
      artifacts:
        - name: analysis-result
          type: "application/json"

    - name: summarizer
      agentRef: summarizer-agent
      dependsOn: ["document-analyzer"]
      inputFrom:
        - taskRef: document-analyzer
          artifactName: analysis-result

    - name: quality-check
      agentRef: qa-agent
      dependsOn: ["summarizer"]
      condition: "success"

  # A2A protocol settings
  communication:
    streaming: true
    pushNotifications:
      enabled: true
      endpoint: "http://callback-service/webhook"

  # Output configuration
  output:
    storage:
      type: "s3"
      bucket: "ai-results"
      prefix: "outputs/"
    format:
      - type: "application/json"
      - type: "text/markdown"

  # Error handling
  errorPolicy:
    maxRetries: 3
    backoffLimit: 600
    failureAction: "rollback"

Controller Implementation

The controllers implement the A2A protocol's core functionality:

  1. Agent Discovery:

    • Automatically generates and manages .well-known/agent.json endpoints
    • Handles capability registration and updates
    • Manages agent metadata and health checks
  2. Task Management:

    • Implements A2A task lifecycle (submitted β†’ working β†’ completed/failed)
    • Handles streaming updates via Server-Sent Events (SSE)
    • Manages task artifacts and state transitions
  3. Communication:

    • Implements A2A message formats and parts
    • Handles both synchronous and streaming communication
    • Manages push notifications and webhooks
  4. Resource Orchestration:

    • GPU allocation and scheduling
    • Memory and compute resource management
    • Model loading and unloading

πŸ” Development Setup

Development Environment

We provide a consistent development environment using VS Code Dev Containers. This ensures all developers have the same tools and versions.

  1. Prerequisites:

  2. Getting Started:

    # Clone the repository
    git clone https://github.com/yourusername/fullStackOllama.git
    cd fullStackOllama
    
    # Open in VS Code
    code .
    
    # Click "Reopen in Container" when prompted
    # or use Command Palette (F1) -> "Remote-Containers: Reopen in Container"

The dev container includes:

  • All required development tools
  • Pre-configured pre-commit hooks
  • VS Code extensions for Terraform, Go, and Kubernetes
  • AWS and Kubernetes config mounting

Alternatively, if you prefer local installation:

Pre-commit Hooks

This repository uses pre-commit hooks to ensure code quality and consistency. The following checks are performed before each commit:

  1. General Checks

    • Trailing whitespace removal
    • End of file fixing
    • YAML syntax validation
    • Large file checks
    • Merge conflict detection
    • Private key detection
  2. Terraform Checks

    • Format validation (terraform fmt)
    • Configuration validation (terraform validate)
    • Documentation updates
    • Security scanning (Checkov)
    • Linting (TFLint)
  3. Go Code Checks

    • Format validation (go fmt)
    • Code analysis (go vet)
    • Comprehensive linting (golangci-lint)
  4. Custom Validations

    • CRD syntax and structure validation
    • Model definition validation
    • Kubernetes resource validation

Setup Instructions

  1. Install pre-commit:

    brew install pre-commit
  2. Install required tools:

    brew install terraform-docs tflint checkov
    go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest
  3. Install the pre-commit hooks:

    pre-commit install
  4. (Optional) Run against all files:

    pre-commit run --all-files

Continuous Integration

The same checks are run in CI/CD pipelines to ensure consistency. See the GitHub Actions workflows for details.


πŸ—οΈ Model Build Process

GitOps Workflow

The model building process follows GitOps principles, ensuring that all changes are tracked, reviewed, and automatically deployed:

  1. Model Definition

    # models/summarizer/model.yaml
    apiVersion: ai.stack/v1alpha1
    kind: OllamaModelDefinition
    metadata:
      name: summarizer-model
    spec:
      from: llama2
      build:
        system: |
          You are a specialized summarization agent...
  2. Pull Request Flow

    • Create branch: feature/add-summarizer-model
    • Add/modify model definition in models/ directory
    • Create PR with changes
    • Automated validation:
      • YAML syntax
      • Model definition schema
      • Resource requirements check
      • Security scanning
    • PR review and approval
    • Merge to main branch
  3. Flux Synchronization

    # infra/base/models/kustomization.yaml
    apiVersion: kustomize.config.k8s.io/v1beta1
    kind: Kustomization
    resources:
      - ../../models  # Watches the models directory
    • Flux detects changes in the models/ directory
    • Applies new/modified OllamaModelDefinition to the cluster
    • Triggers the build controller
  4. Build Process

    sequenceDiagram
      participant Flux
      participant API Server
      participant Build Controller
      participant Build Job
      participant Registry
    
      Flux->>API Server: Apply OllamaModelDefinition
      API Server->>Build Controller: Notify new/modified definition
      Build Controller->>Build Job: Create build job
      Build Job->>Build Job: Execute ollama create
      Build Job->>Registry: Push built model
      Build Job->>API Server: Update status
      Build Controller->>API Server: Update conditions
    
    Loading
  5. Build Controller Actions

    • Creates a Kubernetes Job for building
    • Mounts required GPU resources
    • Executes ollama create with definition
    • Monitors build progress
    • Updates status conditions
    • Handles failures and retries
    • Registers successful builds
  6. Model Registration

    • Successful builds are registered in the cluster
    • Model becomes available for OllamaAgent instances
    • Version tracking and rollback support
    • Automatic cleanup of old versions
  7. Monitoring & Logs

    # Example build job logs
    2025-04-23T13:30:00Z [INFO] Starting build for summarizer-model
    2025-04-23T13:30:05Z [INFO] Downloading base model llama2
    2025-04-23T13:31:00Z [INFO] Applying model adaptations
    2025-04-23T13:32:00Z [INFO] Registering model summarizer-model:1.0.0
    2025-04-23T13:32:05Z [INFO] Build complete

Security Considerations

  • All model definitions are version controlled
  • PR reviews ensure quality and security
  • Base models are pulled from trusted sources
  • Build jobs run in isolated environments
  • Resource limits are strictly enforced
  • Model provenance is tracked and verified

Resource Management

  • Build jobs are scheduled based on GPU availability
  • Parallel builds are supported with resource quotas
  • Failed builds are automatically cleaned up
  • Successful builds are cached for reuse
  • Version tags ensure reproducibility

πŸš€ Getting Started

Prerequisites

  • Kubernetes cluster with GPU-enabled nodes (AWS EKS, GKE, or bare-metal)
  • NVIDIA GPU Operator installed
  • Kubectl + Kustomize + Helm
  • Golang (for controller development)

Deployment Steps

  1. Set up Infrastructure
# Deploy the EKS cluster using Terraform
cd infra/cluster-iac
terraform init
terraform apply
  1. Bootstrap Flux

The repository includes a bootstrap script to set up Flux with the correct configuration:

# Option 1: Using environment variable
export GITHUB_TOKEN=your_github_token
./scripts/bootstrap-flux.sh

# Option 2: Passing token directly
./scripts/bootstrap-flux.sh -t your_github_token

# Additional options available:
./scripts/bootstrap-flux.sh -h  # Show help

The bootstrap script will:

  • Install Flux CLI if not present
  • Clean up any existing Flux installation
  • Configure Flux with your GitHub repository
  • Set up monitoring and logging components
  • Verify the installation and show status
  1. Apply CRDs
kubectl apply -f crds/

4. **Deploy example agents**
kubectl apply -f ollama-operators/service-deployments/

5. **Submit an orchestration task**
kubectl apply -f examples/question-answering/task.yaml

πŸ“Έ Diagrams

See docs/architecture.md and docs/orchestration-diagram.png for detailed system visuals.


🀝 Contributing

This project is a personal and professional showcase. However, contributors are welcome! PRs, Issues, and suggestions encouraged.


πŸ“š Learning Goals

This project is also a journey of exploration. Through it, we aim to learn and demonstrate:

  • GPU scheduling with Kubernetes
  • Multi-agent AI orchestration
  • Building CRDs and operators with Go
  • Best practices in GitOps and cloud-native ML
  • Open-source model hosting and scaling

πŸ“œ License

MIT License


πŸ”— Related Projects

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published