CosmicAC Staging Server Deployment Guide
Comprehensive step-by-step instructions for deploying the full CosmicAC application stack on a staging server, from server setup through job creation.
CosmicAC Staging Server Deployment Guide
This document provides comprehensive instructions for deploying the CosmicAC application stack on a staging server.
Table of Contents
- Server Setup
- Node.js Environment Setup
- Caddy Web Server Setup
- Repository Setup
- PM2 Configuration
- Starting the Application Stack
- Autobase Connection
- Registering Things & Racks
- Creating Jobs
- Troubleshooting
1. Server Setup
Create the cosmicac User and Group
All application components will run under the cosmicac user account. Other team members can be added to the cosmicac group to manage PM2 and services.
# Create the cosmicac group
sudo groupadd cosmicac
# Create the user with home directory and add to cosmicac group
sudo useradd -m -s /bin/bash -g cosmicac cosmicac
# Set a password (optional, but recommended)
sudo passwd cosmicac
# Add to sudo group if needed for initial setup
sudo usermod -aG sudo cosmicacConfigure Sudoers for cosmicac Group
Create a sudoers file to allow members of the cosmicac group to run commands as the cosmicac user without a password. This enables PM2 management.
Create /etc/sudoers.d/cosmicac:
# Allow members of cosmicac group to run commands as cosmicac user
%cosmicac ALL=(cosmicac) NOPASSWD: ALL
# Allow members to switch to cosmicac user shell
%cosmicac ALL=(cosmicac) NOPASSWD: /bin/bash, /bin/shApply the configuration:
# Create the sudoers file (must use visudo for safety)
sudo visudo -f /etc/sudoers.d/cosmicac
# Or create directly with proper permissions
echo '%cosmicac ALL=(cosmicac) NOPASSWD: ALL' | sudo tee /etc/sudoers.d/cosmicac
sudo chmod 440 /etc/sudoers.d/cosmicac
# Validate sudoers syntax
sudo visudo -cAdd Team Members to cosmicac Group
# Add existing users to cosmicac group
sudo usermod -aG cosmicac <username>
# Verify group membership
groups <username>Managing PM2 as Team Member
Once added to the cosmicac group, team members can manage PM2:
# Run PM2 commands as cosmicac user
sudo -u cosmicac pm2 status
sudo -u cosmicac pm2 logs
sudo -u cosmicac pm2 restart all
# Switch to cosmicac user shell (for multiple commands)
sudo -u cosmicac bash -lVerify User Setup
whoami # Should output: cosmicac
echo $HOME # Should output: /home/cosmicacConfigure Git
Set up Git to use HTTPS instead of SSH/git protocols and enable credential caching:
# Create/update .gitconfig
cat > ~/.gitconfig << 'EOF'
[url "https://github.com/"]
insteadOf = git@github.com:
[url "https://"]
insteadOf = git://
[credential]
helper = cache --timeout=3600
EOFThis configuration:
- Redirects
git@github.com:URLs to HTTPS (avoids SSH key requirements) - Redirects
git://protocol URLs to HTTPS - Caches credentials for 1 hour (3600 seconds) to avoid repeated prompts
Verify the configuration:
cat ~/.gitconfig
git config --list | grep -E "(url|credential)"Rootless Docker Setup
Rootless Docker allows containers to run without root privileges, improving security.
System-Level Configuration (Run as root/sudo)
Step 1: Update system and install prerequisites
sudo apt-get update && sudo apt-get upgrade -y
sudo apt-get install -y \
curl \
ca-certificates \
gnupg \
lsb-release \
uidmap \
dbus-user-session \
fuse-overlayfs \
slirp4netns \
systemd-container \
iproute2 \
iptablesStep 2: Add Docker's official GPG key
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpgStep 3: Add Docker repository
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/nullStep 4: Install Docker Engine
# Update apt with the new repository
sudo apt-get update
# Install Docker (includes dockerd-rootless-setuptool.sh)
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# Verify installation
docker --versionTroubleshooting: If you get "Package docker-ce has no installation candidate":
- Check your distribution:
lsb_release -cs - Verify the repository was added:
cat /etc/apt/sources.list.d/docker.list - Make sure you ran
apt-get updateafter adding the repository
Step 5: Configure system for rootless Docker
Create sysctl configuration file /etc/sysctl.d/99-rootless-docker.conf:
# Enable user namespaces for rootless Docker
kernel.unprivileged_userns_clone=1
# Allow unprivileged users to bind to ports >= 80
net.ipv4.ip_unprivileged_port_start=80
# Increase the number of inotify watches
fs.inotify.max_user_watches=524288
fs.inotify.max_user_instances=512
# Network settings for better container networking
net.ipv4.ip_forward=1
net.ipv4.conf.all.route_localnet=1Apply the configuration:
# Apply sysctl settings
sudo sysctl --system
# Set up subordinate UIDs and GIDs for cosmicac user (for user namespace mapping)
# Check if already configured, add only if not present
grep -q "^cosmicac:" /etc/subuid || echo "cosmicac:100000:65536" | sudo tee -a /etc/subuid
grep -q "^cosmicac:" /etc/subgid || echo "cosmicac:100000:65536" | sudo tee -a /etc/subgid
# Verify the entries
cat /etc/subuid
cat /etc/subgid
# Enable lingering (allows user services to run without login)
sudo loginctl enable-linger cosmicac
# Create XDG_RUNTIME_DIR for cosmicac user (required for systemd user session)
COSMICAC_UID=$(id -u cosmicac)
sudo mkdir -p /run/user/${COSMICAC_UID}
sudo chown cosmicac:cosmicac /run/user/${COSMICAC_UID}
sudo chmod 700 /run/user/${COSMICAC_UID}
# Disable system Docker daemon (we'll use rootless instead)
sudo systemctl disable --now docker.service docker.socketUser-Level Configuration (Run as cosmicac)
Switch to cosmicac user with proper systemd environment:
# Get cosmicac UID
COSMICAC_UID=$(id -u cosmicac)
# Switch to cosmicac with proper systemd environment
sudo -u cosmicac \
XDG_RUNTIME_DIR=/run/user/${COSMICAC_UID} \
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/${COSMICAC_UID}/bus \
bash -lOnce logged in as cosmicac:
# Verify environment variables are set
echo "XDG_RUNTIME_DIR=$XDG_RUNTIME_DIR"
echo "UID=$(id -u)"
# If XDG_RUNTIME_DIR is empty, set it manually
export XDG_RUNTIME_DIR=/run/user/$(id -u)
export DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$(id -u)/bus
# Run rootless Docker setup
dockerd-rootless-setuptool.sh install
# Create service override for proper networking
mkdir -p ~/.config/systemd/user/docker.service.d
cat > ~/.config/systemd/user/docker.service.d/override.conf << 'EOF'
[Service]
Environment="DOCKERD_ROOTLESS_ROOTLESSKIT_DISABLE_HOST_LOOPBACK=false"
Environment="DOCKERD_ROOTLESS_ROOTLESSKIT_NET=slirp4netns"
Environment="DOCKERD_ROOTLESS_ROOTLESSKIT_PORT_DRIVER=builtin"
EOF
# Create Docker daemon config
mkdir -p ~/.config/docker
cat > ~/.config/docker/daemon.json << 'EOF'
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
},
"default-address-pools": [
{
"base": "172.17.0.0/16",
"size": 24
}
]
}
EOF
# Set environment variables (add to .bashrc)
cat >> ~/.bashrc << 'EOF'
# Rootless Docker configuration
export DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock
export PATH=$HOME/bin:$PATH
EOF
# Source the environment
source ~/.bashrc
# Enable and start Docker for this user
systemctl --user enable docker
systemctl --user start docker
# Verify installation
docker --version
docker compose version
# Test Docker networking
docker run --rm alpine echo "Docker networking test successful!"Verify Rootless Docker
# Check Docker daemon status
systemctl --user status docker
# Check Docker socket exists
ls -la /run/user/$(id -u)/docker.sock
# Test port binding
docker run --rm -d -p 8888:80 --name test-nginx nginx:alpine
sleep 2
curl -s http://localhost:8888 && echo "Port binding works!"
docker stop test-nginx2. Node.js Environment Setup
All dependencies are installed at the user level (not system-wide) using NVM.
Install NVM (Node Version Manager)
# Download and install NVM
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
# Reload shell configuration
source ~/.bashrc
# or
source ~/.profile
# Verify NVM installation
nvm --versionInstall Node.js 20
# Install Node.js 20 LTS
nvm install 20
# Set Node 20 as default
nvm alias default 20
# Verify installation
node --version # Should output: v20.x.x
npm --version # Should output: 10.x.xInstall Global Packages (User Level)
# Install PM2 (process manager)
npm install -g pm2
# Install hp-rpc-cli (RPC command line tool)
npm install -g hp-rpc-cli
# Verify installations
pm2 --version
npx hp-rpc-cli --version
# Setup PM2 startup script (optional - for auto-restart on reboot)
pm2 startup
# Follow the instructions output by the commandVerify Environment
# Run this to confirm everything is set up correctly
echo "Node: $(node --version)"
echo "NPM: $(npm --version)"
echo "PM2: $(pm2 --version)"
echo "hp-rpc-cli: $(npx hp-rpc-cli --version 2>/dev/null || echo 'installed')"
echo "User: $(whoami)"
echo "Home: $HOME"3. Caddy Web Server Setup
Caddy is used as a reverse proxy to route traffic to the application components.
Install Caddy (Run as root/sudo)
# Install Caddy via apt
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update
sudo apt install caddy
# Verify installation
caddy versionConfigure Tailscale for Caddy Certificates
To allow Caddy to obtain HTTPS certificates from Tailscale, add the following to /etc/default/tailscaled:
# Add Caddy certificate permission to Tailscale
echo 'TS_PERMIT_CERT_UID=caddy' | sudo tee -a /etc/default/tailscaled
# Restart Tailscale to apply changes
sudo systemctl restart tailscaledThis allows Caddy to automatically obtain and renew TLS certificates for your *.ts.net domain.
Configure Caddy
Create the Caddyfile at /etc/caddy/Caddyfile:
stg-cosmicac.tail8a2a3f.ts.net {
# API routes -> app-node (port 3000)
handle_path /api/* {
reverse_proxy :3000
}
# Inference routes -> proxy-inference (port 8000) with streaming
handle_path /inference/* {
reverse_proxy :8000 {
flush_interval -1
transport http {
read_buffer 0
write_buffer 0
}
}
}
# Everything else -> UI (port 5173)
reverse_proxy * :5173
}Apply the configuration:
# Edit the Caddyfile
sudo nano /etc/caddy/Caddyfile
# Or create it directly
sudo tee /etc/caddy/Caddyfile << 'EOF'
stg-cosmicac.tail8a2a3f.ts.net {
handle_path /api/* {
reverse_proxy :3000
}
handle_path /inference/* {
reverse_proxy :8000 {
flush_interval -1
transport http {
read_buffer 0
write_buffer 0
}
}
}
reverse_proxy * :5173
}
EOF
# Validate the configuration
sudo caddy validate --config /etc/caddy/Caddyfile
# Reload Caddy
sudo systemctl reload caddyCaddy Service Management
# Start Caddy
sudo systemctl start caddy
# Enable Caddy to start on boot
sudo systemctl enable caddy
# Check status
sudo systemctl status caddy
# View logs
sudo journalctl -u caddy -f
# Reload after config changes
sudo systemctl reload caddyRoute Configuration Reference
| Route | Backend | Port | Description |
|---|---|---|---|
/api/* | app-node | 3000 | API endpoints |
/inference/* | proxy-inference | 8000 | Inference with streaming support |
* (default) | cosmicac-ui | 5173 | Frontend UI |
Note: The flush_interval -1 and buffer settings on /inference/* enable real-time streaming for inference responses.
4. Repository Setup
Application Components (Execution Order)
The following components need to be deployed in this specific order:
| Order | Repository | Branch (Staging) | Branch (Current) | Description |
|---|---|---|---|---|
| 1 | cosmicac-wrk-ork | stg | dev | Orchestrator worker |
| 2 | cosmicac-app-node | stg | dev | Main application node |
| 3 | cosmicac-ui | stg | dev | User interface |
| 4 | cosmicac-wrk-server-k8s-nvidia | stg | dev | K8s NVIDIA server worker |
| 5 | cosmicac-proxy-inference | stg | dev | Inference proxy |
| 6 | tether-wrk-ext-sendgrid | stg | dev | SendGrid email service |
Note: The default branch for staging is stg, but we are currently using dev branch.
Clone Repositories (Manual Step)
All repositories are cloned directly into the user's home directory /home/cosmicac:
cd ~
# Clone in execution order (using dev branch for now)
git clone -b main https://github.com/tetherto/cosmicac-wrk-ork.git
git clone -b main https://github.com/tetherto/cosmicac-app-node.git
git clone -b main https://github.com/tetherto/cosmicac-ui.git
git clone -b main https://github.com/tetherto/cosmicac-wrk-server-k8s-nvidia.git
git clone -b main https://github.com/tetherto/cosmicac-proxy-inference.git
git clone -b main https://github.com/tetherto/tether-wrk-ext-sendgrid.gitWhen switching to staging branch: Replace -b dev with -b stg in the commands above.
Automated Repository Setup
After cloning, run the setup automation script to install dependencies and configure each repository.
Setup Script: setup-repos.sh
Create this script in ~/setup-repos.sh:
#!/bin/bash
set -e
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Define repositories in execution order
REPOS=(
"cosmicac-wrk-ork"
"cosmicac-app-node"
"cosmicac-ui"
"cosmicac-wrk-server-k8s-nvidia"
"cosmicac-proxy-inference"
"tether-wrk-ext-sendgrid"
)
BASE_DIR="${1:-$(pwd)}"
echo -e "${GREEN}========================================${NC}"
echo -e "${GREEN} CosmicAC Repository Setup Script${NC}"
echo -e "${GREEN}========================================${NC}"
echo ""
echo "Base directory: $BASE_DIR"
echo ""
setup_repo() {
local repo=$1
local repo_path="$BASE_DIR/$repo"
local steps=2
# cosmicac-ui requires build step
if [ "$repo" = "cosmicac-ui" ]; then
steps=3
fi
echo -e "${YELLOW}----------------------------------------${NC}"
echo -e "${YELLOW}Setting up: $repo${NC}"
echo -e "${YELLOW}----------------------------------------${NC}"
if [ ! -d "$repo_path" ]; then
echo -e "${RED}ERROR: Repository not found at $repo_path${NC}"
echo -e "${RED}Please clone the repository first.${NC}"
return 1
fi
cd "$repo_path"
# Step 1: Install dependencies
echo -e "${GREEN}[1/$steps] Installing dependencies...${NC}"
if [ -f "package-lock.json" ]; then
echo "Found package-lock.json, running npm ci..."
npm ci
else
echo "No package-lock.json found, running npm install..."
npm install
fi
# Step 2: Run setup-config.sh if present
echo -e "${GREEN}[2/$steps] Running setup-config.sh...${NC}"
if [ -f "setup-config.sh" ]; then
chmod +x setup-config.sh
./setup-config.sh
else
echo "No setup-config.sh found, skipping..."
fi
# Step 3: Build (only for cosmicac-ui)
if [ "$repo" = "cosmicac-ui" ]; then
echo -e "${GREEN}[3/$steps] Building UI...${NC}"
npm run build
fi
echo -e "${GREEN}✓ $repo setup complete${NC}"
echo ""
cd "$BASE_DIR"
}
# Main execution
echo "Starting setup for ${#REPOS[@]} repositories..."
echo ""
FAILED=()
for repo in "${REPOS[@]}"; do
if ! setup_repo "$repo"; then
FAILED+=("$repo")
fi
done
echo ""
echo -e "${GREEN}========================================${NC}"
echo -e "${GREEN} Setup Complete${NC}"
echo -e "${GREEN}========================================${NC}"
if [ ${#FAILED[@]} -gt 0 ]; then
echo ""
echo -e "${RED}The following repositories failed setup:${NC}"
for repo in "${FAILED[@]}"; do
echo -e "${RED} - $repo${NC}"
done
exit 1
else
echo ""
echo -e "${GREEN}All repositories set up successfully!${NC}"
echo ""
echo "Next steps:"
echo " 1. Copy stg.ecosystem.config.js to ~/"
echo " 2. Copy autobase-connect.js to ~/"
echo " 3. Start with: pm2 start stg.ecosystem.config.js"
fiMake the Script Executable and Run
chmod +x ~/setup-repos.sh
# Run the setup from home directory
cd ~
./setup-repos.sh5. PM2 Configuration
Copy PM2 Folder (Manual Step)
TODO: The pm2/ folder (containing stg.ecosystem.config.js, dev.ecosystem.config.js, and package.json) is not part of any of the six cloned repositories. Obtain it from your team's internal shared location or artifact store.
Copy the entire pm2 folder to the home directory:
# Copy the pm2 folder with ecosystem configs
cp -r /path/to/pm2 ~/Note: The autobase-connect.js script is created manually in Section 7 — Autobase Connection.
Install PM2 Dependencies
cd ~/pm2
npm installThis installs hypercore-crypto which is needed for automatic HRPC keypair generation.
Ecosystem Configuration Reference
The stg.ecosystem.config.js file manages all worker processes. Here's the component configuration:
| Component | PM2 Name Pattern | Default Port | Worker Type / Command |
|---|---|---|---|
| wrk-ork | wrk-ork-{i} | - | wrk-ork-proc-aggr |
| app-node | app-node-{i} | 3000 | wrk-node-http |
| wrk-server-k8s-nvidia | wrk-server-k8s-nvidia-{i} | - | wrk-server-rack-k8s |
| proxy-inference | proxy-inference-http-{i} | 8000 | wrk-proxy-http |
| proxy-inference | proxy-inference-hrpc-{i} | - | wrk-proxy-hrpc |
| tether-wrk-ext-sendgrid | wrk-ext-sendgrid | - | wrk-ext-sendgrid |
| cosmicac-ui | app-ui | 5173 | npx serve -s -l 5173 dist |
Note: The UI runs as a static file server using serve package, not as a worker.
Automatic App-Node Secrets Generation
The ecosystem config automatically generates secrets for cosmicac-app-node/config/common.json on first run.
| Secret | Length | Default Value (triggers generation) |
|---|---|---|
| signUpSecret | 16 chars (A-Za-z0-9) | SIGN_UP_SECRET |
| mfaSecretKey | 16 chars (A-Za-z0-9) | MFA_SECRET_KEY |
| apiKeySecret | 64 chars (A-Za-z0-9) | API_KEY_HASHING_SECRET_CHANGE_IN_PRODUCTION |
Secrets are only generated if set to their default placeholder values. Already configured secrets are not overwritten.
Automatic HRPC Keypair Generation
The ecosystem config automatically generates an HRPC keypair for cosmicac-proxy-inference if one doesn't exist.
When PM2 loads the ecosystem config, it checks cosmicac-proxy-inference/config/hrpc.json:
- If
rpcKeypair.secretKeyandrpcKeypair.publicKeyare both empty, it generates a new keypair usinghypercore-crypto - The generated keys are saved back to
hrpc.json - If keys already exist, no changes are made
The hrpc.json file should have this structure:
{
"rpcKeypair": {
"secretKey": "",
"publicKey": ""
}
}After the first PM2 start, it will be populated with the generated keys.
6. Starting the Application Stack
Initial Start (Sequential)
Due to dependencies between workers, the first startup requires a sequential approach:
Step 1: Start wrk-ork
cd ~/pm2
# Start wrk-ork first (creates status files needed by other workers)
pm2 start stg.ecosystem.config.js --only wrk-ork-0
# Wait for wrk-ork to initialize (check logs)
pm2 logs wrk-ork-0
# Wait until you see it's fully started, then Ctrl+CStep 2: Configure app-node
After wrk-ork is running, configure app-node before starting it.
2a. Copy wrk-ork rpcPublicKey to app-node config
Get the rpcPublicKey from wrk-ork status and add it to app-node's config:
# Get the rpcPublicKey from wrk-ork status
cat ~/cosmicac-wrk-ork/status/*.json | jq '.rpcPublicKey'Edit ~/cosmicac-app-node/config/common.json and add the orks configuration:
{
"orks": {
"cluster-0": {
"rpcPublicKey": "<RPC_PUBLIC_KEY_FROM_WRK_ORK>"
}
}
}2b. Configure UI static path
Add the UI path to ~/cosmicac-app-node/config/common.json:
{
"staticRootPath": "/home/cosmicac/cosmicac-ui/"
}2c. Configure OAuth2
Edit ~/cosmicac-app-node/config/facs/httpd-oauth2.config.json with your OAuth2 settings:
{
"enabled": true,
"providers": {
"google": {
"clientId": "<YOUR_GOOGLE_CLIENT_ID>",
"clientSecret": "<YOUR_GOOGLE_CLIENT_SECRET>",
"callbackUrl": "https://<YOUR_DOMAIN>/auth/google/callback"
}
},
"sessionSecret": "<YOUR_SESSION_SECRET>",
"cookieDomain": "<YOUR_DOMAIN>"
}Note: Replace the placeholder values with your actual OAuth2 credentials.
Step 3: Start app-node
# Start app-node (depends on wrk-ork status)
pm2 start stg.ecosystem.config.js --only app-node-0
# Wait for app-node to initialize
pm2 logs app-node-0
# Wait until ready, then Ctrl+CStep 4: Start remaining workers
# Start the rest of the workers
pm2 start stg.ecosystem.config.jsSubsequent Starts
After the initial setup, you can start all services at once:
pm2 start stg.ecosystem.config.jsUseful PM2 Commands
# Check status of all processes
pm2 status
# View logs for all processes
pm2 logs
# View logs for a specific process
pm2 logs app-node-0
# Restart all processes
pm2 restart all
# Restart specific process
pm2 restart app-node-0
# Stop all processes
pm2 stop all
# Delete all processes from PM2
pm2 delete all
# Monitor resources
pm2 monit
# Save current process list (for auto-restart)
pm2 save7. Autobase Connection {#7.-autobase-connection}
After all workers are running, establish the autobase connection.
Create autobase-connect.js
Create a file named ~/autobase-connect.js with the contents below:
'use strict';
const fs = require('fs/promises');
const path = require('path');
const { exec } = require('child_process');
const { promisify } = require('util');
const execAsync = promisify(exec);
const loadStatusField = async (file, key) => {
try {
const content = await fs.readFile(file, 'utf-8');
return JSON.parse(content)?.[key];
} catch (err) {
if (err.code !== 'ENOENT') {
console.error('Failed to read:', file, err.message);
}
return null;
}
};
const runRegisterCommand = async (autobase, rpcPublicKey) => {
if (!autobase?.writer) return;
const command = `npx hp-rpc-cli -s ${rpcPublicKey} -m registerAutobaseWriter -d '${JSON.stringify({
key: autobase.writer,
})}'`;
console.log('▶ Running:', command);
try {
const { stdout, stderr } = await execAsync(command);
if (stderr) {
console.warn('⚠️ Stderr:', stderr);
}
if (stdout) {
console.log('✅ Output:', stdout);
}
} catch (err) {
console.error('❌ Command failed:', err.message);
}
};
const processStatusDir = async (baseDir, rpcPublicKey, skipFile) => {
const statusDir = path.join(baseDir, 'status');
try {
const files = await fs.readdir(statusDir);
for (const file of files) {
if (file === skipFile) continue;
const autobase = await loadStatusField(
path.join(statusDir, file),
'autobase'
);
await runRegisterCommand(autobase, rpcPublicKey);
}
} catch (err) {
if (err.code !== 'ENOENT') {
console.error('Failed to process dir:', statusDir, err.message);
}
}
};
(async () => {
const appCwd = path.join(__dirname, 'cosmicac-app-node');
const proxyInferenceCwd = path.join(__dirname, 'cosmicac-proxy-inference');
const mainRpcPublicKey = await loadStatusField(
path.join(appCwd, 'status', 'wrk-node-http-3000.json'),
'rpcPublicKey'
);
if (!mainRpcPublicKey) {
console.error('❌ rpcPublicKey not found');
return;
}
await processStatusDir(appCwd, mainRpcPublicKey, 'wrk-node-http-3000.json');
await processStatusDir(proxyInferenceCwd, mainRpcPublicKey);
})();Run the Autobase Connection
cd ~
# Run the autobase connection script
node autobase-connect.jsThis script:
- Reads the
rpcPublicKeyfromcosmicac-app-node/status/wrk-node-http-3000.json - Registers autobase writers from both
cosmicac-app-nodeandcosmicac-proxy-inference - Creates the communication link between the components
Verify Connection
Check for success messages:
- ✅ indicates successful registration
- ❌ indicates a failure (check logs for details)
8. Registering Things & Racks
Get RPC Public Keys
Each worker has an rpcPublicKey stored in its status file:
# Get wrk-server-k8s-nvidia rpcPublicKey
cat ~/cosmicac-wrk-server-k8s-nvidia/status/*.json | jq '.rpcPublicKey'
# Get wrk-ork rpcPublicKey
cat ~/cosmicac-wrk-ork/status/*.json | jq '.rpcPublicKey'Register K8s Control Plane (Thing)
npx hp-rpc-cli -s <RPC_PUBLIC_KEY_OF_WRK_SERVER_K8S_NVIDIA> -m registerThing -d '{
"id": "<THING_ID>",
"opts": {
"inCluster": false,
"clusters": [{
"name": "cluster.local",
"server": "<CONTROL_PLANE_URL>",
"caData": "<CA_DATA>",
"skipTLSVerify": false
}],
"users": [{
"name": "<USER_NAME>",
"token": "<TOKEN>"
}],
"contexts": [{
"name": "<USER_NAME>@cluster.local",
"user": "<USER_NAME>",
"cluster": "cluster.local"
}]
},
"info": {},
"tags": ["k8s-control-plane"]
}' -t 100000Verify Thing Registration
npx hp-rpc-cli -s <RPC_PUBLIC_KEY_OF_WRK_SERVER_K8S_NVIDIA> -m isOnline -d '{"id": "<THING_ID>"}' -t 100000Register Rack
npx hp-rpc-cli -s <RPC_PUBLIC_KEY_OF_WRK_ORK> -m registerRack -d '{
"id": "<RACK_ID>",
"type": "server",
"info": {
"rpcPublicKey": "<RPC_PUBLIC_KEY_OF_WRK_SERVER_K8S_NVIDIA>",
"location": "IN"
}
}' -t 1000009. Creating Jobs
Example: Create Inference Job
npx hp-rpc-cli -s <RPC_PUBLIC_KEY_OF_WRK_ORK> -m createJob -d '{
"gpu": {
"count": 1,
"type": "GA106_RTX_A2000_12GB"
},
"location": "IN",
"userId": 1,
"name": "new-inference-job",
"tags": ["inference"],
"type": "INFERENCE_VLLM",
"params": {
"docker_image": "abhi07/cosmicac-wrk-agent-inference:latest",
"image_pull_policy": "Always",
"namespace": "default",
"config_debug": "1",
"model_name": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"model_source": "huggingface",
"agent_topic": "@cosmicac/agent-inference",
"is_managed_inference": "true",
"handshake_secret": "secret",
"crypto_key": "a1234567890b1234567890c1234567890",
"crypto_algo": "hmac-sha384",
"vllm_startup_timeout_ms": "300000",
"node_env": "development",
"swap_space": "0",
"dtype": "float16",
"enforce_eager": "true",
"env": [
{
"name": "HF_TOKEN",
"valueFrom": {
"secretKeyRef": {
"name": "hf-token-secret",
"key": "HF_TOKEN"
}
}
}
],
"cpu_limit": "4",
"memory_limit": "4Gi",
"cpu_request": "2",
"memory_request": "4Gi"
}
}' -t 10000010. Troubleshooting
Common Issues
PM2 Processes Keep Restarting
# Check logs for errors
pm2 logs --err
# Check specific process logs
pm2 logs app-node-0 --lines 100Status Files Not Found
Ensure workers are started in the correct order. The status files are created when workers initialize:
# Check if status files exist
ls -la ~/cosmicac-wrk-ork/status/
ls -la ~/cosmicac-app-node/status/Node Version Issues
# Verify you're using Node 20
node --version
# If wrong version, switch
nvm use 20Permission Denied
# Ensure you're running as cosmicac user
whoami
# If not, switch to cosmicac
sudo -u cosmicac -iLog Locations
PM2 logs are stored in:
~/.pm2/logs/View all available logs:
ls -la ~/.pm2/logs/Health Check Script
Create ~/health-check.sh:
#!/bin/bash
echo "=== CosmicAC Health Check ==="
echo ""
echo "PM2 Status:"
pm2 jlist | jq -r '.[] | "\(.name): \(.pm2_env.status)"'
echo ""
echo "Status Files:"
for dir in cosmicac-wrk-ork cosmicac-app-node cosmicac-wrk-server-k8s-nvidia cosmicac-proxy-inference; do
if [ -d "$HOME/$dir/status" ]; then
echo " ✓ $dir/status exists"
else
echo " ✗ $dir/status missing"
fi
done
echo ""
echo "Ports in use:"
netstat -tlnp 2>/dev/null | grep -E ':(3000|8000)' || echo " No relevant ports found"chmod +x ~/health-check.sh
./health-check.shQuick Reference
Directory Structure
/home/cosmicac/
├── .gitconfig
├── .nvm/
├── pm2/ # PM2 configuration folder
│ ├── package.json
│ ├── stg.ecosystem.config.js
│ ├── dev.ecosystem.config.js
│ └── node_modules/
├── setup-repos.sh
├── autobase-connect.js
├── health-check.sh
├── cosmicac-wrk-ork/
├── cosmicac-app-node/
├── cosmicac-ui/
├── cosmicac-wrk-server-k8s-nvidia/
├── cosmicac-proxy-inference/
└── tether-wrk-ext-sendgrid/Startup Sequence
cd ~/pm2 && npm install(first time only)pm2 start stg.ecosystem.config.js --only wrk-ork-0- Wait for wrk-ork initialization
- Configure app-node:
- Copy
rpcPublicKeyfrom wrk-ork status toapp-node/config/common.json - Set
staticRootPathto/home/cosmicac/cosmicac-ui/ - Configure OAuth2 in
config/facs/httpd-oauth2.config.json
- Copy
pm2 start stg.ecosystem.config.js --only app-node-0- Wait for app-node initialization
pm2 start stg.ecosystem.config.js(starts remaining)cd ~ && node autobase-connect.js- Register things and racks as needed
Environment Checklist
- [ ] User
cosmicacand group created - [ ] Sudoers configured (
/etc/sudoers.d/cosmicac) - [ ] Team members added to
cosmicacgroup - [ ] Git configured (
.gitconfigwith HTTPS redirects) - [ ] Rootless Docker configured:
- [ ] System sysctl settings applied
- [ ] subuid/subgid configured
- [ ] User lingering enabled
- [ ] Docker service override created
- [ ] Docker daemon running (
systemctl --user status docker)
- [ ] NVM installed
- [ ] Node 20 installed and set as default
- [ ] PM2 installed globally (user-level)
- [ ] hp-rpc-cli installed globally (user-level)
- [ ] Caddy installed and configured (
/etc/caddy/Caddyfile) - [ ] Caddy service running (
systemctl status caddy) - [ ] All repositories cloned (on
devbranch, will switch tostglater) - [ ]
setup-repos.shexecuted successfully - [ ]
pm2folder copied to home directory - [ ]
npm installrun in~/pm2 - [ ]
autobase-connect.jsin place - [ ] wrk-ork started and status file created
- [ ] app-node configured:
- [ ]
orks.rpcPublicKeyadded toconfig/common.json - [ ]
staticRootPathset to/home/cosmicac/cosmicac-ui/ - [ ] OAuth2 configured in
config/facs/httpd-oauth2.config.json
- [ ]
- [ ] All PM2 processes running
- [ ] Autobase connection established
Branch Reference
| Repository | Current Branch | Target Branch (Staging) |
|---|---|---|
| cosmicac-wrk-ork | dev | stg |
| cosmicac-app-node | dev | stg |
| cosmicac-ui | dev | stg |
| cosmicac-wrk-server-k8s-nvidia | dev | stg |
| cosmicac-proxy-inference | dev | stg |
| tether-wrk-ext-sendgrid | dev | stg |
=========
DNS Setup missing:
- go to dash.cloudflare.com > domains > DNS, and add DNS record for deployment
- record: cosmicac.tether.su
- IP: <ip of the server>
- Proxied: true
GCP setup
need to provision server + firewall rules with following information:
# This code is compatible with Terraform 4.25.0 and versions that are backwards compatible to 4.25.0.
# For information about validating this Terraform code, see https://developer.hashicorp.com/terraform/tutorials/gcp-get-started/google-cloud-platform-build#format-and-validate-the-configuration
resource "google_compute_instance" "prod-cosmicac-0" {
boot_disk {
auto_delete = true
device_name = "prod-cosmicac-0"
initialize_params {
image = "projects/ubuntu-os-cloud/global/images/ubuntu-minimal-2404-noble-amd64-v20260325"
size = 150
type = "pd-balanced"
}
mode = "READ_WRITE"
}
can_ip_forward = false
deletion_protection = false
enable_display = false
labels = {
goog-ec-src = "vm_add-tf"
goog-ops-agent-policy = "v2-template-1-7-0"
}
machine_type = "e2-custom-16-32768"
metadata = {
enable-osconfig = "TRUE"
ssh-keys = "chetas:ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCwcxwlMknWSIi3YDarjByHONbVLtSvjiFw0PXh+LDWbMlzWc3Zniiiz3MfaeyZltWaFIIicz+ikz9zMfvPr3Um2BGadBpzQbm0lqMMQCHy3D4t7MojzS+O5S7urIp9mPBMDwpv1vL3XudeM8yj3vYmGbPV/uKfIu0Aucy0yKKpGH/LaUmzePFEaUHSAYSNmp+BfyMWR5Un0nluJ2k8SZJfitRMOl/ALgEwmRCEQB3rJb6PMqXXh9xAScl39PTUREbFvCQJrw/efaFFfZhbKFrojTQRlky3s4HS5uh2kh1KZvrErsC3yuPex9P/8qCNjnuoU8pxAbc5uSy7wtjvCMsle7dZ1FczxAXJAtJgDtrofX5LjznUkPwBpEtwyjBvgq4BXsGBj8V3V9vHBgprSzGXPOP/Bosg+iy7K3BBYkE4MaJF2cLVH+g3+LK7BM5brier4BBSqa9dgEjsGrSNjnpiO2v15iWJW3R1a+6LmYNdqbzi16lgizaby/fKRjxyqvr9sUYJVrimaYmyNgfcDNrSA3PbYbMjDTWgujiBbRBXsuhnF/59T+84KdnHDC49gy5GQUXez3tOEbu/2JkDjxZK5C7Zj+aujpp1osgVXkPRhDpPzj4RiAK16cQMPTZoHvbNbbJ1cYfB12GNWja8iZMwyT347ykkyDMyy/XRwRtGew== chetas"
}
name = "prod-cosmicac-0"
network_interface {
access_config {
nat_ip = "34.122.95.57"
network_tier = "PREMIUM"
}
queue_count = 0
stack_type = "IPV4_ONLY"
subnetwork = "projects/tether-data-sec-cosmicac/regions/europe-west6/subnetworks/prd-private-subnet"
}
reservation_affinity {
type = "ANY_RESERVATION"
}
scheduling {
automatic_restart = true
on_host_maintenance = "MIGRATE"
preemptible = false
provisioning_model = "STANDARD"
}
service_account {
email = "846467450615-compute@developer.gserviceaccount.com"
scopes = ["https://www.googleapis.com/auth/devstorage.read_only", "https://www.googleapis.com/auth/logging.write", "https://www.googleapis.com/auth/monitoring.write", "https://www.googleapis.com/auth/service.management.readonly", "https://www.googleapis.com/auth/servicecontrol", "https://www.googleapis.com/auth/trace.append"]
}
shielded_instance_config {
enable_integrity_monitoring = true
enable_secure_boot = false
enable_vtpm = true
}
tags = ["prod"]
zone = "europe-west6-c"
}
module "ops_agent_policy" {
source = "github.com/terraform-google-modules/terraform-google-cloud-operations/modules/ops-agent-policy"
project = "tether-data-sec-cosmicac"
zone = "europe-west6-c"
assignment_id = "goog-ops-agent-v2-template-1-7-0-europe-west6-c"
agents_rule = {
package_state = "installed"
version = "latest"
}
instance_filter = {
all = false
inclusion_labels = [{
labels = {
goog-ops-agent-policy = "v2-template-1-7-0"
}
}]
}
}firewall rules
gcloud compute --project=tether-data-sec-cosmicac firewall-rules create prod-cosmicac-cf --direction=INGRESS --priority=1000 --network=prd-vpc --action=ALLOW --rules=tcp:443 --source-ranges=173.245.48.0/20,103.21.244.0/22,103.22.200.0/22,103.31.4.0/22,141.101.64.0/18,108.162.192.0/18,190.93.240.0/20,188.114.96.0/20,197.234.240.0/22,198.41.128.0/17,162.158.0.0/15,104.16.0.0/13,104.24.0.0/14,172.64.0.0/13,131.0.72.0/22 --target-tags=prod
gcloud compute --project=tether-data-sec-cosmicac firewall-rules create prod-cosmicac-chetas --direction=INGRESS --priority=1000 --network=prd-vpc --action=ALLOW --rules=tcp:22 --source-ranges=64.227.130.182/32 --target-tags=prod-
have to manually add hypermq key
-
missing kubernetes config setup
-
had to create superuser manually and setup pricing
-
incorrect config setup for oauth (missing /callback in callbackUriUI)
Additional notes (Formatting pending)
Sendgrid setup
### tether-wrk-ext-sendgrid
1. We need to clone repo and install dependency using npm i
2. We need to run ./setup-config.sh
3. then we need to update
a. config/sendgrid.ext.json
{
"apiKey": "", // only need to add sendgrid api key here
"defaultTemplate": "cosmicac",
"overrideEmailSender": ""
}
b. config/facs/net.config.json
{
"r0": {
"allow": [], // Add rpcClientKey here of the app-node
"allowLocal": true
}
}
4. run the worker using the command
node worker.js --wtype wrk-ext-sendgrid --env development
5. Then you need to configer this worker in cosmicac-app-node
location: config/common.json
you need to update
"emailService": {
"rpcPublicKey": "EMAIL_SERVICE_RPC_PUBLIC_KEY", // rpcPublickey of the ext-sendgrid worker.
"from": {
"name": "Cosmicac No Reply",
"email": "EMAIL_SENDER" // add the email that we will use for prod
},
"template": { // update those url according to the prod url.
"pwdIcon": "https://dev-cosmicac.tail8a2a3f.ts.net/assets/email-reset-password.png",
"pwdResetURL": "https://dev-cosmicac.tail8a2a3f.ts.net/new-password"
}
},