CosmicAC Logo

CosmicAC Staging Server Deployment Guide

Comprehensive step-by-step instructions for deploying the full CosmicAC application stack on a staging server, from server setup through job creation.

CosmicAC Staging Server Deployment Guide

This document provides comprehensive instructions for deploying the CosmicAC application stack on a staging server.

Table of Contents

  1. Server Setup
  2. Node.js Environment Setup
  3. Caddy Web Server Setup
  4. Repository Setup
  5. PM2 Configuration
  6. Starting the Application Stack
  7. Autobase Connection
  8. Registering Things & Racks
  9. Creating Jobs
  10. Troubleshooting

1. Server Setup

Create the cosmicac User and Group

All application components will run under the cosmicac user account. Other team members can be added to the cosmicac group to manage PM2 and services.

# Create the cosmicac group
sudo groupadd cosmicac

# Create the user with home directory and add to cosmicac group
sudo useradd -m -s /bin/bash -g cosmicac cosmicac

# Set a password (optional, but recommended)
sudo passwd cosmicac

# Add to sudo group if needed for initial setup
sudo usermod -aG sudo cosmicac

Configure Sudoers for cosmicac Group

Create a sudoers file to allow members of the cosmicac group to run commands as the cosmicac user without a password. This enables PM2 management.

Create /etc/sudoers.d/cosmicac:

# Allow members of cosmicac group to run commands as cosmicac user
%cosmicac ALL=(cosmicac) NOPASSWD: ALL

# Allow members to switch to cosmicac user shell
%cosmicac ALL=(cosmicac) NOPASSWD: /bin/bash, /bin/sh

Apply the configuration:

# Create the sudoers file (must use visudo for safety)
sudo visudo -f /etc/sudoers.d/cosmicac

# Or create directly with proper permissions
echo '%cosmicac ALL=(cosmicac) NOPASSWD: ALL' | sudo tee /etc/sudoers.d/cosmicac
sudo chmod 440 /etc/sudoers.d/cosmicac

# Validate sudoers syntax
sudo visudo -c

Add Team Members to cosmicac Group

# Add existing users to cosmicac group
sudo usermod -aG cosmicac <username>

# Verify group membership
groups <username>

Managing PM2 as Team Member

Once added to the cosmicac group, team members can manage PM2:

# Run PM2 commands as cosmicac user
sudo -u cosmicac pm2 status
sudo -u cosmicac pm2 logs
sudo -u cosmicac pm2 restart all

# Switch to cosmicac user shell (for multiple commands)
sudo -u cosmicac bash -l

Verify User Setup

whoami   # Should output: cosmicac
echo $HOME   # Should output: /home/cosmicac

Configure Git

Set up Git to use HTTPS instead of SSH/git protocols and enable credential caching:

# Create/update .gitconfig
cat > ~/.gitconfig << 'EOF'
[url "https://github.com/"]
    insteadOf = git@github.com:
[url "https://"]
    insteadOf = git://
[credential]
    helper = cache --timeout=3600
EOF

This configuration:

  • Redirects git@github.com: URLs to HTTPS (avoids SSH key requirements)
  • Redirects git:// protocol URLs to HTTPS
  • Caches credentials for 1 hour (3600 seconds) to avoid repeated prompts

Verify the configuration:

cat ~/.gitconfig
git config --list | grep -E "(url|credential)"

Rootless Docker Setup

Rootless Docker allows containers to run without root privileges, improving security.

System-Level Configuration (Run as root/sudo)

Step 1: Update system and install prerequisites

sudo apt-get update && sudo apt-get upgrade -y

sudo apt-get install -y \
  curl \
  ca-certificates \
  gnupg \
  lsb-release \
  uidmap \
  dbus-user-session \
  fuse-overlayfs \
  slirp4netns \
  systemd-container \
  iproute2 \
  iptables

Step 2: Add Docker's official GPG key

sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

Step 3: Add Docker repository

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Step 4: Install Docker Engine

# Update apt with the new repository
sudo apt-get update

# Install Docker (includes dockerd-rootless-setuptool.sh)
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# Verify installation
docker --version

Troubleshooting: If you get "Package docker-ce has no installation candidate":

  1. Check your distribution: lsb_release -cs
  2. Verify the repository was added: cat /etc/apt/sources.list.d/docker.list
  3. Make sure you ran apt-get update after adding the repository

Step 5: Configure system for rootless Docker

Create sysctl configuration file /etc/sysctl.d/99-rootless-docker.conf:

# Enable user namespaces for rootless Docker
kernel.unprivileged_userns_clone=1

# Allow unprivileged users to bind to ports >= 80
net.ipv4.ip_unprivileged_port_start=80

# Increase the number of inotify watches
fs.inotify.max_user_watches=524288
fs.inotify.max_user_instances=512

# Network settings for better container networking
net.ipv4.ip_forward=1
net.ipv4.conf.all.route_localnet=1

Apply the configuration:

# Apply sysctl settings
sudo sysctl --system

# Set up subordinate UIDs and GIDs for cosmicac user (for user namespace mapping)
# Check if already configured, add only if not present
grep -q "^cosmicac:" /etc/subuid || echo "cosmicac:100000:65536" | sudo tee -a /etc/subuid
grep -q "^cosmicac:" /etc/subgid || echo "cosmicac:100000:65536" | sudo tee -a /etc/subgid

# Verify the entries
cat /etc/subuid
cat /etc/subgid

# Enable lingering (allows user services to run without login)
sudo loginctl enable-linger cosmicac

# Create XDG_RUNTIME_DIR for cosmicac user (required for systemd user session)
COSMICAC_UID=$(id -u cosmicac)
sudo mkdir -p /run/user/${COSMICAC_UID}
sudo chown cosmicac:cosmicac /run/user/${COSMICAC_UID}
sudo chmod 700 /run/user/${COSMICAC_UID}

# Disable system Docker daemon (we'll use rootless instead)
sudo systemctl disable --now docker.service docker.socket

User-Level Configuration (Run as cosmicac)

Switch to cosmicac user with proper systemd environment:

# Get cosmicac UID
COSMICAC_UID=$(id -u cosmicac)

# Switch to cosmicac with proper systemd environment
sudo -u cosmicac \
  XDG_RUNTIME_DIR=/run/user/${COSMICAC_UID} \
  DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/${COSMICAC_UID}/bus \
  bash -l

Once logged in as cosmicac:

# Verify environment variables are set
echo "XDG_RUNTIME_DIR=$XDG_RUNTIME_DIR"
echo "UID=$(id -u)"

# If XDG_RUNTIME_DIR is empty, set it manually
export XDG_RUNTIME_DIR=/run/user/$(id -u)
export DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$(id -u)/bus

# Run rootless Docker setup
dockerd-rootless-setuptool.sh install

# Create service override for proper networking
mkdir -p ~/.config/systemd/user/docker.service.d
cat > ~/.config/systemd/user/docker.service.d/override.conf << 'EOF'
[Service]
Environment="DOCKERD_ROOTLESS_ROOTLESSKIT_DISABLE_HOST_LOOPBACK=false"
Environment="DOCKERD_ROOTLESS_ROOTLESSKIT_NET=slirp4netns"
Environment="DOCKERD_ROOTLESS_ROOTLESSKIT_PORT_DRIVER=builtin"
EOF

# Create Docker daemon config
mkdir -p ~/.config/docker
cat > ~/.config/docker/daemon.json << 'EOF'
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  },
  "default-address-pools": [
    {
      "base": "172.17.0.0/16",
      "size": 24
    }
  ]
}
EOF

# Set environment variables (add to .bashrc)
cat >> ~/.bashrc << 'EOF'

# Rootless Docker configuration
export DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock
export PATH=$HOME/bin:$PATH
EOF

# Source the environment
source ~/.bashrc

# Enable and start Docker for this user
systemctl --user enable docker
systemctl --user start docker

# Verify installation
docker --version
docker compose version

# Test Docker networking
docker run --rm alpine echo "Docker networking test successful!"

Verify Rootless Docker

# Check Docker daemon status
systemctl --user status docker

# Check Docker socket exists
ls -la /run/user/$(id -u)/docker.sock

# Test port binding
docker run --rm -d -p 8888:80 --name test-nginx nginx:alpine
sleep 2
curl -s http://localhost:8888 && echo "Port binding works!"
docker stop test-nginx

2. Node.js Environment Setup

All dependencies are installed at the user level (not system-wide) using NVM.

Install NVM (Node Version Manager)

# Download and install NVM
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash

# Reload shell configuration
source ~/.bashrc
# or
source ~/.profile

# Verify NVM installation
nvm --version

Install Node.js 20

# Install Node.js 20 LTS
nvm install 20

# Set Node 20 as default
nvm alias default 20

# Verify installation
node --version   # Should output: v20.x.x
npm --version    # Should output: 10.x.x

Install Global Packages (User Level)

# Install PM2 (process manager)
npm install -g pm2

# Install hp-rpc-cli (RPC command line tool)
npm install -g hp-rpc-cli

# Verify installations
pm2 --version
npx hp-rpc-cli --version

# Setup PM2 startup script (optional - for auto-restart on reboot)
pm2 startup
# Follow the instructions output by the command

Verify Environment

# Run this to confirm everything is set up correctly
echo "Node: $(node --version)"
echo "NPM: $(npm --version)"
echo "PM2: $(pm2 --version)"
echo "hp-rpc-cli: $(npx hp-rpc-cli --version 2>/dev/null || echo 'installed')"
echo "User: $(whoami)"
echo "Home: $HOME"

3. Caddy Web Server Setup

Caddy is used as a reverse proxy to route traffic to the application components.

Install Caddy (Run as root/sudo)

# Install Caddy via apt
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl

curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg

curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list

sudo apt update
sudo apt install caddy

# Verify installation
caddy version

Configure Tailscale for Caddy Certificates

To allow Caddy to obtain HTTPS certificates from Tailscale, add the following to /etc/default/tailscaled:

# Add Caddy certificate permission to Tailscale
echo 'TS_PERMIT_CERT_UID=caddy' | sudo tee -a /etc/default/tailscaled

# Restart Tailscale to apply changes
sudo systemctl restart tailscaled

This allows Caddy to automatically obtain and renew TLS certificates for your *.ts.net domain.

Configure Caddy

Create the Caddyfile at /etc/caddy/Caddyfile:

stg-cosmicac.tail8a2a3f.ts.net {
    # API routes -> app-node (port 3000)
    handle_path /api/* {
        reverse_proxy :3000
    }

    # Inference routes -> proxy-inference (port 8000) with streaming
    handle_path /inference/* {
        reverse_proxy :8000 {
            flush_interval -1
            transport http {
                read_buffer 0
                write_buffer 0
            }
        }
    }

    # Everything else -> UI (port 5173)
    reverse_proxy * :5173
}

Apply the configuration:

# Edit the Caddyfile
sudo nano /etc/caddy/Caddyfile

# Or create it directly
sudo tee /etc/caddy/Caddyfile << 'EOF'
stg-cosmicac.tail8a2a3f.ts.net {
    handle_path /api/* {
        reverse_proxy :3000
    }

    handle_path /inference/* {
        reverse_proxy :8000 {
            flush_interval -1
            transport http {
                read_buffer 0
                write_buffer 0
            }
        }
    }

    reverse_proxy * :5173
}
EOF

# Validate the configuration
sudo caddy validate --config /etc/caddy/Caddyfile

# Reload Caddy
sudo systemctl reload caddy

Caddy Service Management

# Start Caddy
sudo systemctl start caddy

# Enable Caddy to start on boot
sudo systemctl enable caddy

# Check status
sudo systemctl status caddy

# View logs
sudo journalctl -u caddy -f

# Reload after config changes
sudo systemctl reload caddy

Route Configuration Reference

RouteBackendPortDescription
/api/*app-node3000API endpoints
/inference/*proxy-inference8000Inference with streaming support
* (default)cosmicac-ui5173Frontend UI

Note: The flush_interval -1 and buffer settings on /inference/* enable real-time streaming for inference responses.


4. Repository Setup

Application Components (Execution Order)

The following components need to be deployed in this specific order:

OrderRepositoryBranch (Staging)Branch (Current)Description
1cosmicac-wrk-orkstgdevOrchestrator worker
2cosmicac-app-nodestgdevMain application node
3cosmicac-uistgdevUser interface
4cosmicac-wrk-server-k8s-nvidiastgdevK8s NVIDIA server worker
5cosmicac-proxy-inferencestgdevInference proxy
6tether-wrk-ext-sendgridstgdevSendGrid email service

Note: The default branch for staging is stg, but we are currently using dev branch.

Clone Repositories (Manual Step)

All repositories are cloned directly into the user's home directory /home/cosmicac:

cd ~

# Clone in execution order (using dev branch for now)
git clone -b main https://github.com/tetherto/cosmicac-wrk-ork.git
git clone -b main https://github.com/tetherto/cosmicac-app-node.git
git clone -b main https://github.com/tetherto/cosmicac-ui.git
git clone -b main https://github.com/tetherto/cosmicac-wrk-server-k8s-nvidia.git
git clone -b main https://github.com/tetherto/cosmicac-proxy-inference.git
git clone -b main https://github.com/tetherto/tether-wrk-ext-sendgrid.git

When switching to staging branch: Replace -b dev with -b stg in the commands above.

Automated Repository Setup

After cloning, run the setup automation script to install dependencies and configure each repository.

Setup Script: setup-repos.sh

Create this script in ~/setup-repos.sh:

#!/bin/bash
set -e

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

# Define repositories in execution order
REPOS=(
  "cosmicac-wrk-ork"
  "cosmicac-app-node"
  "cosmicac-ui"
  "cosmicac-wrk-server-k8s-nvidia"
  "cosmicac-proxy-inference"
  "tether-wrk-ext-sendgrid"
)

BASE_DIR="${1:-$(pwd)}"

echo -e "${GREEN}========================================${NC}"
echo -e "${GREEN}  CosmicAC Repository Setup Script${NC}"
echo -e "${GREEN}========================================${NC}"
echo ""
echo "Base directory: $BASE_DIR"
echo ""

setup_repo() {
  local repo=$1
  local repo_path="$BASE_DIR/$repo"
  local steps=2

  # cosmicac-ui requires build step
  if [ "$repo" = "cosmicac-ui" ]; then
    steps=3
  fi

  echo -e "${YELLOW}----------------------------------------${NC}"
  echo -e "${YELLOW}Setting up: $repo${NC}"
  echo -e "${YELLOW}----------------------------------------${NC}"

  if [ ! -d "$repo_path" ]; then
    echo -e "${RED}ERROR: Repository not found at $repo_path${NC}"
    echo -e "${RED}Please clone the repository first.${NC}"
    return 1
  fi

  cd "$repo_path"

  # Step 1: Install dependencies
  echo -e "${GREEN}[1/$steps] Installing dependencies...${NC}"
  if [ -f "package-lock.json" ]; then
    echo "Found package-lock.json, running npm ci..."
    npm ci
  else
    echo "No package-lock.json found, running npm install..."
    npm install
  fi

  # Step 2: Run setup-config.sh if present
  echo -e "${GREEN}[2/$steps] Running setup-config.sh...${NC}"
  if [ -f "setup-config.sh" ]; then
    chmod +x setup-config.sh
    ./setup-config.sh
  else
    echo "No setup-config.sh found, skipping..."
  fi

  # Step 3: Build (only for cosmicac-ui)
  if [ "$repo" = "cosmicac-ui" ]; then
    echo -e "${GREEN}[3/$steps] Building UI...${NC}"
    npm run build
  fi

  echo -e "${GREEN}✓ $repo setup complete${NC}"
  echo ""

  cd "$BASE_DIR"
}

# Main execution
echo "Starting setup for ${#REPOS[@]} repositories..."
echo ""

FAILED=()

for repo in "${REPOS[@]}"; do
  if ! setup_repo "$repo"; then
    FAILED+=("$repo")
  fi
done

echo ""
echo -e "${GREEN}========================================${NC}"
echo -e "${GREEN}  Setup Complete${NC}"
echo -e "${GREEN}========================================${NC}"

if [ ${#FAILED[@]} -gt 0 ]; then
  echo ""
  echo -e "${RED}The following repositories failed setup:${NC}"
  for repo in "${FAILED[@]}"; do
    echo -e "${RED}  - $repo${NC}"
  done
  exit 1
else
  echo ""
  echo -e "${GREEN}All repositories set up successfully!${NC}"
  echo ""
  echo "Next steps:"
  echo "  1. Copy stg.ecosystem.config.js to ~/"
  echo "  2. Copy autobase-connect.js to ~/"
  echo "  3. Start with: pm2 start stg.ecosystem.config.js"
fi

Make the Script Executable and Run

chmod +x ~/setup-repos.sh

# Run the setup from home directory
cd ~
./setup-repos.sh

5. PM2 Configuration

Copy PM2 Folder (Manual Step)

TODO: The pm2/ folder (containing stg.ecosystem.config.js, dev.ecosystem.config.js, and package.json) is not part of any of the six cloned repositories. Obtain it from your team's internal shared location or artifact store.

Copy the entire pm2 folder to the home directory:

# Copy the pm2 folder with ecosystem configs
cp -r /path/to/pm2 ~/

Note: The autobase-connect.js script is created manually in Section 7 — Autobase Connection.

Install PM2 Dependencies

cd ~/pm2
npm install

This installs hypercore-crypto which is needed for automatic HRPC keypair generation.

Ecosystem Configuration Reference

The stg.ecosystem.config.js file manages all worker processes. Here's the component configuration:

ComponentPM2 Name PatternDefault PortWorker Type / Command
wrk-orkwrk-ork-{i}-wrk-ork-proc-aggr
app-nodeapp-node-{i}3000wrk-node-http
wrk-server-k8s-nvidiawrk-server-k8s-nvidia-{i}-wrk-server-rack-k8s
proxy-inferenceproxy-inference-http-{i}8000wrk-proxy-http
proxy-inferenceproxy-inference-hrpc-{i}-wrk-proxy-hrpc
tether-wrk-ext-sendgridwrk-ext-sendgrid-wrk-ext-sendgrid
cosmicac-uiapp-ui5173npx serve -s -l 5173 dist

Note: The UI runs as a static file server using serve package, not as a worker.

Automatic App-Node Secrets Generation

The ecosystem config automatically generates secrets for cosmicac-app-node/config/common.json on first run.

SecretLengthDefault Value (triggers generation)
signUpSecret16 chars (A-Za-z0-9)SIGN_UP_SECRET
mfaSecretKey16 chars (A-Za-z0-9)MFA_SECRET_KEY
apiKeySecret64 chars (A-Za-z0-9)API_KEY_HASHING_SECRET_CHANGE_IN_PRODUCTION

Secrets are only generated if set to their default placeholder values. Already configured secrets are not overwritten.

Automatic HRPC Keypair Generation

The ecosystem config automatically generates an HRPC keypair for cosmicac-proxy-inference if one doesn't exist.

When PM2 loads the ecosystem config, it checks cosmicac-proxy-inference/config/hrpc.json:

  • If rpcKeypair.secretKey and rpcKeypair.publicKey are both empty, it generates a new keypair using hypercore-crypto
  • The generated keys are saved back to hrpc.json
  • If keys already exist, no changes are made

The hrpc.json file should have this structure:

{
  "rpcKeypair": {
    "secretKey": "",
    "publicKey": ""
  }
}

After the first PM2 start, it will be populated with the generated keys.


6. Starting the Application Stack

Initial Start (Sequential)

Due to dependencies between workers, the first startup requires a sequential approach:

Step 1: Start wrk-ork

cd ~/pm2

# Start wrk-ork first (creates status files needed by other workers)
pm2 start stg.ecosystem.config.js --only wrk-ork-0

# Wait for wrk-ork to initialize (check logs)
pm2 logs wrk-ork-0
# Wait until you see it's fully started, then Ctrl+C

Step 2: Configure app-node

After wrk-ork is running, configure app-node before starting it.

2a. Copy wrk-ork rpcPublicKey to app-node config

Get the rpcPublicKey from wrk-ork status and add it to app-node's config:

# Get the rpcPublicKey from wrk-ork status
cat ~/cosmicac-wrk-ork/status/*.json | jq '.rpcPublicKey'

Edit ~/cosmicac-app-node/config/common.json and add the orks configuration:

{
  "orks": {
    "cluster-0": {
      "rpcPublicKey": "<RPC_PUBLIC_KEY_FROM_WRK_ORK>"
    }
  }
}

2b. Configure UI static path

Add the UI path to ~/cosmicac-app-node/config/common.json:

{
  "staticRootPath": "/home/cosmicac/cosmicac-ui/"
}

2c. Configure OAuth2

Edit ~/cosmicac-app-node/config/facs/httpd-oauth2.config.json with your OAuth2 settings:

{
  "enabled": true,
  "providers": {
    "google": {
      "clientId": "<YOUR_GOOGLE_CLIENT_ID>",
      "clientSecret": "<YOUR_GOOGLE_CLIENT_SECRET>",
      "callbackUrl": "https://<YOUR_DOMAIN>/auth/google/callback"
    }
  },
  "sessionSecret": "<YOUR_SESSION_SECRET>",
  "cookieDomain": "<YOUR_DOMAIN>"
}

Note: Replace the placeholder values with your actual OAuth2 credentials.

Step 3: Start app-node

# Start app-node (depends on wrk-ork status)
pm2 start stg.ecosystem.config.js --only app-node-0

# Wait for app-node to initialize
pm2 logs app-node-0
# Wait until ready, then Ctrl+C

Step 4: Start remaining workers

# Start the rest of the workers
pm2 start stg.ecosystem.config.js

Subsequent Starts

After the initial setup, you can start all services at once:

pm2 start stg.ecosystem.config.js

Useful PM2 Commands

# Check status of all processes
pm2 status

# View logs for all processes
pm2 logs

# View logs for a specific process
pm2 logs app-node-0

# Restart all processes
pm2 restart all

# Restart specific process
pm2 restart app-node-0

# Stop all processes
pm2 stop all

# Delete all processes from PM2
pm2 delete all

# Monitor resources
pm2 monit

# Save current process list (for auto-restart)
pm2 save

7. Autobase Connection {#7.-autobase-connection}

After all workers are running, establish the autobase connection.

Create autobase-connect.js

Create a file named ~/autobase-connect.js with the contents below:

'use strict';

const fs = require('fs/promises');
const path = require('path');
const { exec } = require('child_process');
const { promisify } = require('util');
const execAsync = promisify(exec);

const loadStatusField = async (file, key) => {
  try {
    const content = await fs.readFile(file, 'utf-8');
    return JSON.parse(content)?.[key];
  } catch (err) {
    if (err.code !== 'ENOENT') {
      console.error('Failed to read:', file, err.message);
    }
    return null;
  }
};

const runRegisterCommand = async (autobase, rpcPublicKey) => {
  if (!autobase?.writer) return;

  const command = `npx hp-rpc-cli -s ${rpcPublicKey} -m registerAutobaseWriter -d '${JSON.stringify({
    key: autobase.writer,
  })}'`;

  console.log('▶ Running:', command);

  try {
    const { stdout, stderr } = await execAsync(command);
    if (stderr) {
      console.warn('⚠️ Stderr:', stderr);
    }
    if (stdout) {
      console.log('✅ Output:', stdout);
    }
  } catch (err) {
    console.error('❌ Command failed:', err.message);
  }
};

const processStatusDir = async (baseDir, rpcPublicKey, skipFile) => {
  const statusDir = path.join(baseDir, 'status');

  try {
    const files = await fs.readdir(statusDir);
    for (const file of files) {
      if (file === skipFile) continue;

      const autobase = await loadStatusField(
        path.join(statusDir, file),
        'autobase'
      );

      await runRegisterCommand(autobase, rpcPublicKey);
    }
  } catch (err) {
    if (err.code !== 'ENOENT') {
      console.error('Failed to process dir:', statusDir, err.message);
    }
  }
};

(async () => {
  const appCwd = path.join(__dirname, 'cosmicac-app-node');
  const proxyInferenceCwd = path.join(__dirname, 'cosmicac-proxy-inference');

  const mainRpcPublicKey = await loadStatusField(
    path.join(appCwd, 'status', 'wrk-node-http-3000.json'),
    'rpcPublicKey'
  );

  if (!mainRpcPublicKey) {
    console.error('❌ rpcPublicKey not found');
    return;
  }

  await processStatusDir(appCwd, mainRpcPublicKey, 'wrk-node-http-3000.json');
  await processStatusDir(proxyInferenceCwd, mainRpcPublicKey);
})();

Run the Autobase Connection

cd ~

# Run the autobase connection script
node autobase-connect.js

This script:

  1. Reads the rpcPublicKey from cosmicac-app-node/status/wrk-node-http-3000.json
  2. Registers autobase writers from both cosmicac-app-node and cosmicac-proxy-inference
  3. Creates the communication link between the components

Verify Connection

Check for success messages:

  • ✅ indicates successful registration
  • ❌ indicates a failure (check logs for details)

8. Registering Things & Racks

Get RPC Public Keys

Each worker has an rpcPublicKey stored in its status file:

# Get wrk-server-k8s-nvidia rpcPublicKey
cat ~/cosmicac-wrk-server-k8s-nvidia/status/*.json | jq '.rpcPublicKey'

# Get wrk-ork rpcPublicKey
cat ~/cosmicac-wrk-ork/status/*.json | jq '.rpcPublicKey'

Register K8s Control Plane (Thing)

npx hp-rpc-cli -s <RPC_PUBLIC_KEY_OF_WRK_SERVER_K8S_NVIDIA> -m registerThing -d '{
  "id": "<THING_ID>",
  "opts": {
    "inCluster": false,
    "clusters": [{
      "name": "cluster.local",
      "server": "<CONTROL_PLANE_URL>",
      "caData": "<CA_DATA>",
      "skipTLSVerify": false
    }],
    "users": [{
      "name": "<USER_NAME>",
      "token": "<TOKEN>"
    }],
    "contexts": [{
      "name": "<USER_NAME>@cluster.local",
      "user": "<USER_NAME>",
      "cluster": "cluster.local"
    }]
  },
  "info": {},
  "tags": ["k8s-control-plane"]
}' -t 100000

Verify Thing Registration

npx hp-rpc-cli -s <RPC_PUBLIC_KEY_OF_WRK_SERVER_K8S_NVIDIA> -m isOnline -d '{"id": "<THING_ID>"}' -t 100000

Register Rack

npx hp-rpc-cli -s <RPC_PUBLIC_KEY_OF_WRK_ORK> -m registerRack -d '{
  "id": "<RACK_ID>",
  "type": "server",
  "info": {
    "rpcPublicKey": "<RPC_PUBLIC_KEY_OF_WRK_SERVER_K8S_NVIDIA>",
    "location": "IN"
  }
}' -t 100000

9. Creating Jobs

Example: Create Inference Job

npx hp-rpc-cli -s <RPC_PUBLIC_KEY_OF_WRK_ORK> -m createJob -d '{
  "gpu": {
    "count": 1,
    "type": "GA106_RTX_A2000_12GB"
  },
  "location": "IN",
  "userId": 1,
  "name": "new-inference-job",
  "tags": ["inference"],
  "type": "INFERENCE_VLLM",
  "params": {
    "docker_image": "abhi07/cosmicac-wrk-agent-inference:latest",
    "image_pull_policy": "Always",
    "namespace": "default",
    "config_debug": "1",
    "model_name": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "model_source": "huggingface",
    "agent_topic": "@cosmicac/agent-inference",
    "is_managed_inference": "true",
    "handshake_secret": "secret",
    "crypto_key": "a1234567890b1234567890c1234567890",
    "crypto_algo": "hmac-sha384",
    "vllm_startup_timeout_ms": "300000",
    "node_env": "development",
    "swap_space": "0",
    "dtype": "float16",
    "enforce_eager": "true",
    "env": [
      {
        "name": "HF_TOKEN",
        "valueFrom": {
          "secretKeyRef": {
            "name": "hf-token-secret",
            "key": "HF_TOKEN"
          }
        }
      }
    ],
    "cpu_limit": "4",
    "memory_limit": "4Gi",
    "cpu_request": "2",
    "memory_request": "4Gi"
  }
}' -t 100000

10. Troubleshooting

Common Issues

PM2 Processes Keep Restarting

# Check logs for errors
pm2 logs --err

# Check specific process logs
pm2 logs app-node-0 --lines 100

Status Files Not Found

Ensure workers are started in the correct order. The status files are created when workers initialize:

# Check if status files exist
ls -la ~/cosmicac-wrk-ork/status/
ls -la ~/cosmicac-app-node/status/

Node Version Issues

# Verify you're using Node 20
node --version

# If wrong version, switch
nvm use 20

Permission Denied

# Ensure you're running as cosmicac user
whoami

# If not, switch to cosmicac
sudo -u cosmicac -i

Log Locations

PM2 logs are stored in:

~/.pm2/logs/

View all available logs:

ls -la ~/.pm2/logs/

Health Check Script

Create ~/health-check.sh:

#!/bin/bash

echo "=== CosmicAC Health Check ==="
echo ""
echo "PM2 Status:"
pm2 jlist | jq -r '.[] | "\(.name): \(.pm2_env.status)"'
echo ""
echo "Status Files:"
for dir in cosmicac-wrk-ork cosmicac-app-node cosmicac-wrk-server-k8s-nvidia cosmicac-proxy-inference; do
  if [ -d "$HOME/$dir/status" ]; then
    echo "  ✓ $dir/status exists"
  else
    echo "  ✗ $dir/status missing"
  fi
done
echo ""
echo "Ports in use:"
netstat -tlnp 2>/dev/null | grep -E ':(3000|8000)' || echo "  No relevant ports found"
chmod +x ~/health-check.sh
./health-check.sh

Quick Reference

Directory Structure

/home/cosmicac/
├── .gitconfig
├── .nvm/
├── pm2/                                # PM2 configuration folder
│   ├── package.json
│   ├── stg.ecosystem.config.js
│   ├── dev.ecosystem.config.js
│   └── node_modules/
├── setup-repos.sh
├── autobase-connect.js
├── health-check.sh
├── cosmicac-wrk-ork/
├── cosmicac-app-node/
├── cosmicac-ui/
├── cosmicac-wrk-server-k8s-nvidia/
├── cosmicac-proxy-inference/
└── tether-wrk-ext-sendgrid/

Startup Sequence

  1. cd ~/pm2 && npm install (first time only)
  2. pm2 start stg.ecosystem.config.js --only wrk-ork-0
  3. Wait for wrk-ork initialization
  4. Configure app-node:
    • Copy rpcPublicKey from wrk-ork status to app-node/config/common.json
    • Set staticRootPath to /home/cosmicac/cosmicac-ui/
    • Configure OAuth2 in config/facs/httpd-oauth2.config.json
  5. pm2 start stg.ecosystem.config.js --only app-node-0
  6. Wait for app-node initialization
  7. pm2 start stg.ecosystem.config.js (starts remaining)
  8. cd ~ && node autobase-connect.js
  9. Register things and racks as needed

Environment Checklist

  • [ ] User cosmicac and group created
  • [ ] Sudoers configured (/etc/sudoers.d/cosmicac)
  • [ ] Team members added to cosmicac group
  • [ ] Git configured (.gitconfig with HTTPS redirects)
  • [ ] Rootless Docker configured:
    • [ ] System sysctl settings applied
    • [ ] subuid/subgid configured
    • [ ] User lingering enabled
    • [ ] Docker service override created
    • [ ] Docker daemon running (systemctl --user status docker)
  • [ ] NVM installed
  • [ ] Node 20 installed and set as default
  • [ ] PM2 installed globally (user-level)
  • [ ] hp-rpc-cli installed globally (user-level)
  • [ ] Caddy installed and configured (/etc/caddy/Caddyfile)
  • [ ] Caddy service running (systemctl status caddy)
  • [ ] All repositories cloned (on dev branch, will switch to stg later)
  • [ ] setup-repos.sh executed successfully
  • [ ] pm2 folder copied to home directory
  • [ ] npm install run in ~/pm2
  • [ ] autobase-connect.js in place
  • [ ] wrk-ork started and status file created
  • [ ] app-node configured:
    • [ ] orks.rpcPublicKey added to config/common.json
    • [ ] staticRootPath set to /home/cosmicac/cosmicac-ui/
    • [ ] OAuth2 configured in config/facs/httpd-oauth2.config.json
  • [ ] All PM2 processes running
  • [ ] Autobase connection established

Branch Reference

RepositoryCurrent BranchTarget Branch (Staging)
cosmicac-wrk-orkdevstg
cosmicac-app-nodedevstg
cosmicac-uidevstg
cosmicac-wrk-server-k8s-nvidiadevstg
cosmicac-proxy-inferencedevstg
tether-wrk-ext-sendgriddevstg

=========
DNS Setup missing:
- go to dash.cloudflare.com > domains > DNS, and add DNS record for deployment
- record: cosmicac.tether.su
- IP: <ip of the server>
- Proxied: true

GCP setup
need to provision server + firewall rules with following information:

# This code is compatible with Terraform 4.25.0 and versions that are backwards compatible to 4.25.0.
# For information about validating this Terraform code, see https://developer.hashicorp.com/terraform/tutorials/gcp-get-started/google-cloud-platform-build#format-and-validate-the-configuration

resource "google_compute_instance" "prod-cosmicac-0" {
  boot_disk {
    auto_delete = true
    device_name = "prod-cosmicac-0"

    initialize_params {
      image = "projects/ubuntu-os-cloud/global/images/ubuntu-minimal-2404-noble-amd64-v20260325"
      size  = 150
      type  = "pd-balanced"
    }

    mode = "READ_WRITE"
  }

  can_ip_forward      = false
  deletion_protection = false
  enable_display      = false

  labels = {
    goog-ec-src           = "vm_add-tf"
    goog-ops-agent-policy = "v2-template-1-7-0"
  }

  machine_type = "e2-custom-16-32768"

  metadata = {
    enable-osconfig = "TRUE"
    ssh-keys        = "chetas:ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCwcxwlMknWSIi3YDarjByHONbVLtSvjiFw0PXh+LDWbMlzWc3Zniiiz3MfaeyZltWaFIIicz+ikz9zMfvPr3Um2BGadBpzQbm0lqMMQCHy3D4t7MojzS+O5S7urIp9mPBMDwpv1vL3XudeM8yj3vYmGbPV/uKfIu0Aucy0yKKpGH/LaUmzePFEaUHSAYSNmp+BfyMWR5Un0nluJ2k8SZJfitRMOl/ALgEwmRCEQB3rJb6PMqXXh9xAScl39PTUREbFvCQJrw/efaFFfZhbKFrojTQRlky3s4HS5uh2kh1KZvrErsC3yuPex9P/8qCNjnuoU8pxAbc5uSy7wtjvCMsle7dZ1FczxAXJAtJgDtrofX5LjznUkPwBpEtwyjBvgq4BXsGBj8V3V9vHBgprSzGXPOP/Bosg+iy7K3BBYkE4MaJF2cLVH+g3+LK7BM5brier4BBSqa9dgEjsGrSNjnpiO2v15iWJW3R1a+6LmYNdqbzi16lgizaby/fKRjxyqvr9sUYJVrimaYmyNgfcDNrSA3PbYbMjDTWgujiBbRBXsuhnF/59T+84KdnHDC49gy5GQUXez3tOEbu/2JkDjxZK5C7Zj+aujpp1osgVXkPRhDpPzj4RiAK16cQMPTZoHvbNbbJ1cYfB12GNWja8iZMwyT347ykkyDMyy/XRwRtGew== chetas"
  }

  name = "prod-cosmicac-0"

  network_interface {
    access_config {
      nat_ip       = "34.122.95.57"
      network_tier = "PREMIUM"
    }

    queue_count = 0
    stack_type  = "IPV4_ONLY"
    subnetwork  = "projects/tether-data-sec-cosmicac/regions/europe-west6/subnetworks/prd-private-subnet"
  }

  reservation_affinity {
    type = "ANY_RESERVATION"
  }

  scheduling {
    automatic_restart   = true
    on_host_maintenance = "MIGRATE"
    preemptible         = false
    provisioning_model  = "STANDARD"
  }

  service_account {
    email  = "846467450615-compute@developer.gserviceaccount.com"
    scopes = ["https://www.googleapis.com/auth/devstorage.read_only", "https://www.googleapis.com/auth/logging.write", "https://www.googleapis.com/auth/monitoring.write", "https://www.googleapis.com/auth/service.management.readonly", "https://www.googleapis.com/auth/servicecontrol", "https://www.googleapis.com/auth/trace.append"]
  }

  shielded_instance_config {
    enable_integrity_monitoring = true
    enable_secure_boot          = false
    enable_vtpm                 = true
  }

  tags = ["prod"]
  zone = "europe-west6-c"
}

module "ops_agent_policy" {
  source          = "github.com/terraform-google-modules/terraform-google-cloud-operations/modules/ops-agent-policy"
  project         = "tether-data-sec-cosmicac"
  zone            = "europe-west6-c"
  assignment_id   = "goog-ops-agent-v2-template-1-7-0-europe-west6-c"
  agents_rule = {
    package_state = "installed"
    version = "latest"
  }
  instance_filter = {
    all = false
    inclusion_labels = [{
      labels = {
        goog-ops-agent-policy = "v2-template-1-7-0"
      }
    }]
  }
}

firewall rules

gcloud compute --project=tether-data-sec-cosmicac firewall-rules create prod-cosmicac-cf --direction=INGRESS --priority=1000 --network=prd-vpc --action=ALLOW --rules=tcp:443 --source-ranges=173.245.48.0/20,103.21.244.0/22,103.22.200.0/22,103.31.4.0/22,141.101.64.0/18,108.162.192.0/18,190.93.240.0/20,188.114.96.0/20,197.234.240.0/22,198.41.128.0/17,162.158.0.0/15,104.16.0.0/13,104.24.0.0/14,172.64.0.0/13,131.0.72.0/22 --target-tags=prod

gcloud compute --project=tether-data-sec-cosmicac firewall-rules create prod-cosmicac-chetas --direction=INGRESS --priority=1000 --network=prd-vpc --action=ALLOW --rules=tcp:22 --source-ranges=64.227.130.182/32 --target-tags=prod
  • have to manually add hypermq key

  • missing kubernetes config setup

  • had to create superuser manually and setup pricing

  • incorrect config setup for oauth (missing /callback in callbackUriUI)

    Additional notes (Formatting pending)

Sendgrid setup

### tether-wrk-ext-sendgrid

1. We need to clone repo and install dependency using npm i

2. We need to run ./setup-config.sh

3. then we need to update

a. config/sendgrid.ext.json

{
"apiKey": "", // only need to add sendgrid api key here
"defaultTemplate": "cosmicac",
"overrideEmailSender": ""
}

b. config/facs/net.config.json

{
"r0": {
"allow": [], // Add rpcClientKey here of the app-node
"allowLocal": true
}

}

4. run the worker using the command

node worker.js --wtype wrk-ext-sendgrid --env development

5. Then you need to configer this worker in cosmicac-app-node

location: config/common.json

you need to update

"emailService": {
"rpcPublicKey": "EMAIL_SERVICE_RPC_PUBLIC_KEY", // rpcPublickey of the ext-sendgrid worker.
"from": {
"name": "Cosmicac No Reply",
"email": "EMAIL_SENDER" // add the email that we will use for prod
},
"template": { // update those url according to the prod url.
"pwdIcon": "https://dev-cosmicac.tail8a2a3f.ts.net/assets/email-reset-password.png",
"pwdResetURL": "https://dev-cosmicac.tail8a2a3f.ts.net/new-password"
}

},

On this page

Table of Contents
1. Server Setup
Create the cosmicac User and Group
Configure Sudoers for cosmicac Group
Add Team Members to cosmicac Group
Managing PM2 as Team Member
Verify User Setup
Configure Git
Rootless Docker Setup
System-Level Configuration (Run as root/sudo)
User-Level Configuration (Run as cosmicac)
Verify Rootless Docker
2. Node.js Environment Setup
Install NVM (Node Version Manager)
Install Node.js 20
Install Global Packages (User Level)
Verify Environment
3. Caddy Web Server Setup
Install Caddy (Run as root/sudo)
Configure Tailscale for Caddy Certificates
Configure Caddy
Caddy Service Management
Route Configuration Reference
4. Repository Setup
Application Components (Execution Order)
Clone Repositories (Manual Step)
Automated Repository Setup
Setup Script: setup-repos.sh
Make the Script Executable and Run
5. PM2 Configuration
Copy PM2 Folder (Manual Step)
Install PM2 Dependencies
Ecosystem Configuration Reference
Automatic App-Node Secrets Generation
Automatic HRPC Keypair Generation
6. Starting the Application Stack
Initial Start (Sequential)
Step 1: Start wrk-ork
Step 2: Configure app-node
Step 3: Start app-node
Step 4: Start remaining workers
Subsequent Starts
Useful PM2 Commands
7. Autobase Connection {#7.-autobase-connection}
Create autobase-connect.js
Run the Autobase Connection
Verify Connection
8. Registering Things & Racks
Get RPC Public Keys
Register K8s Control Plane (Thing)
Verify Thing Registration
Register Rack
9. Creating Jobs
Example: Create Inference Job
10. Troubleshooting
Common Issues
PM2 Processes Keep Restarting
Status Files Not Found
Node Version Issues
Permission Denied
Log Locations
Health Check Script
Quick Reference
Directory Structure
Startup Sequence
Environment Checklist
Branch Reference
Additional notes (Formatting pending)
Sendgrid setup