Ansible 一键部署生产级 Docker Swarm 与 Stack 运维实战
在生产环境中部署容器化应用时,单机 Docker Compose 无法保证高可用,而 Kubernetes 的运维和学习成本又让中小型团队望而却步。此时,Docker Swarm 配合 Ansible 是一种兼顾轻量级与生产级特性的极佳方案。
本文将分享一套经过线上高并发业务验证的 Ansible Playbook,实现一键初始化 Swarm 集群,并部署一个包含严格健康检查、全局日志收集(Promtail)、指标监控与自动扩缩容的生产级 Docker Stack。
1. 项目目录结构与环境准备
在控制节点(Ansible Control Node)创建如下目录结构:
ansible-docker-swarm/
├── group_vars/
│ └── all.yml # 全局变量(版本、网段、限制等)
├── inventory.ini # 主机清单
├── playbooks/
│ └── deploy_swarm.yml # 集群部署主入口
└── templates/
├── daemon.json.j2 # Docker 引擎配置文件模板
└── docker-stack.yml.j2 # Stack 部署模板
主机清单 inventory.ini
规划 3 台 Manager 节点和 2 台 Worker 节点。生产环境下,Manager 节点建议至少 3 台以满足 Raft 协议的容灾需求。
[swarm_managers]
manager-01 ansible_host=192.168.10.11 node_role=manager
manager-02 ansible_host=192.168.10.12 node_role=manager
manager-03 ansible_host=192.168.10.13 node_role=manager
[swarm_workers]
worker-01 ansible_host=192.168.10.21 node_role=worker
worker-02 ansible_host=192.168.10.22 node_role=worker
[swarm:children]
swarm_managers
swarm_workers
[swarm:vars]
ansible_user=root
ansible_port=22
2. 系统内核与 Docker 守护进程优化
生产环境下的 Docker 需要针对文件描述符、网络虚拟内存以及日志轮转进行调优,避免出现 OOM 后容器网络卡死或磁盘被日志撑爆的情况。
全局变量 group_vars/all.yml
---
docker_version: "24.0.7"
docker_dns_servers:
- "223.5.5.5"
- "114.114.114.114"
log_max_size: "50m"
log_max_file: "3"
loki_url: "http://192.168.10.100:3100/loki/api/v1/push"
Docker 配置模板 templates/daemon.json.j2
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "{{ log_max_size }}",
"max-file": "{{ log_max_file }}"
},
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true"
],
"dns": {{ docker_dns_servers | to_json }},
"live-restore": true,
"metrics-addr": "0.0.0.0:9323",
"experimental": true
}
注意:开启
live-restore可以在 Docker 守护进程升级或重启时,保证宿主机上的容器继续运行,这对于生产环境平滑维护至关重要。
3. 核心 Ansible Playbook 编写
该 Playbook 实现了从基础环境配置、内核调优、安装最新版 Docker Engine,到自动初始化 Swarm 集群并拼接 Manager/Worker Token 的全流程。
主任务 playbooks/deploy_swarm.yml
---
- name: 生产级环境初始化与 Docker 安装
hosts: swarm
become: yes
tasks:
- name: 优化系统内核参数
sysctl:
name: "{{ item.key }}"
value: "{{ item.value }}"
state: present
sysctl_file: /etc/sysctl.d/99-docker-performance.conf
with_dict:
net.bridge.bridge-nf-call-iptables: "1"
net.bridge.bridge-nf-call-ip6tables: "1"
net.ipv4.ip_forward: "1"
fs.file-max: "2097152"
vm.max_map_count: "262144"
- name: 安装基础依赖包
apt:
name:
- apt-transport-https
- ca-certificates
- curl
- gnupg
- lsb-release
- python3-pip
state: present
update_cache: yes
- name: 添加 Docker 官方 GPG 密钥
apt_key:
url: https://download.docker.com/linux/ubuntu/gpg
state: present
- name: 添加 Docker APT 软件源
apt_repository:
repo: "deb [arch=amd64] https://download.docker.com/linux/ubuntu {{ ansible_distribution_release }} stable"
state: present
- name: 安装指定版本的 Docker Engine
apt:
name:
- "docker-ce=5:{{ docker_version }}~3-0~ubuntu-{{ ansible_distribution_release }}"
- "docker-ce-cli=5:{{ docker_version }}~3-0~ubuntu-{{ ansible_distribution_release }}"
- containerd.io
state: present
- name: 下发优化后的 daemon.json 配置文件
template:
src: ../templates/daemon.json.j2
dest: /etc/docker/daemon.json
mode: '0644'
notify: Restart Docker
- name: 安装 python-docker 模块用于 Ansible 控制
pip:
name: docker
state: present
handlers:
- name: Restart Docker
systemd:
name: docker
state: restarted
daemon_reload: yes
- name: Docker Swarm 集群高可用初始化
hosts: swarm_managers
become: yes
tasks:
- name: 检查 Swarm 状态
shell: docker info --format '\{\{.Swarm.LocalNodeState\}\}'
register: swarm_status
changed_when: false
- name: 初始化第一个 Manager 节点
shell: >
docker swarm init
--advertise-addr {{ ansible_host }}
--data-path-port 4789
when:
- inventory_hostname == groups['swarm_managers'][0]
- swarm_status.stdout != "active"
- name: 获取 Manager Join Token
shell: docker swarm join-token -q manager
register: manager_token_cmd
when: inventory_hostname == groups['swarm_managers'][0]
- name: 获取 Worker Join Token
shell: docker swarm join-token -q worker
register: worker_token_cmd
when: inventory_hostname == groups['swarm_managers'][0]
- name: 在内存中共享 Tokens
add_host:
name: "SWARM_VARIABLES"
manager_token: "{{ hostvars[groups['swarm_managers'][0]]['manager_token_cmd']['stdout'] }}"
worker_token: "{{ hostvars[groups['swarm_managers'][0]]['worker_token_cmd']['stdout'] }}"
when: inventory_hostname == groups['swarm_managers'][0]
- name: 其余 Manager 节点加入集群
shell: >
docker swarm join
--token {{ hostvars['SWARM_VARIABLES']['manager_token'] }}
{{ hostvars[groups['swarm_managers'][0]]['ansible_host'] }}:2377
when:
- inventory_hostname != groups['swarm_managers'][0]
- swarm_status.stdout != "active"
- name: Swarm Worker 节点加入
hosts: swarm_workers
become: yes
tasks:
- name: 检查 Swarm 状态
shell: docker info --format '\{\{.Swarm.LocalNodeState\}\}'
register: swarm_status
changed_when: false
- name: Worker 节点加入集群
shell: >
docker swarm join
--token {{ hostvars['SWARM_VARIABLES']['worker_token'] }}
{{ hostvars[groups['swarm_managers'][0]]['ansible_host'] }}:2377
when: swarm_status.stdout != "active"
4. 生产级 Docker Stack 配置模板
接下来,通过模板下发在 Swarm 集群中拉起业务。该 Stack 包含三个核心服务:
- app:核心业务容器,配置了基于 HTTP 状态码的硬核健康检查。
- promtail:以
global模式(每台机器一个实例)部署,实时抓取宿主机上所有容器的 JSON 日志,并过滤、打标后推送到中央 Loki。 - autoscaler:Swarm 本身不具备 K8s 的 HPA(水平自动扩缩容)功能,我们引入了一个基于官方 API 的轻量级自动扩缩容 sidecar,用于动态调整
app的副本数。
Stack 模板 templates/docker-stack.yml.j2
version: '3.8'
services:
app:
image: nginx:alpine
ports:
- "80:80"
environment:
- APP_ENV=production
configs:
- source: nginx_conf
target: /etc/nginx/conf.d/default.conf
deploy:
replicas: 3
labels:
# 给 autoscaler 识别的标签
- "swarm.autoscaler.enable=true"
- "swarm.autoscaler.min_replicas=2"
- "swarm.autoscaler.max_replicas=10"
- "swarm.autoscaler.metric=cpu"
- "swarm.autoscaler.upper_threshold=80"
- "swarm.autoscaler.lower_threshold=20"
update_config:
parallelism: 1
delay: 15s
order: start-first
failure_action: rollback
rollback_config:
parallelism: 1
order: stop-first
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
resources:
limits:
cpus: '0.50'
memory: 512M
reservations:
cpus: '0.25'
memory: 256M
healthcheck:
test: ["CMD-SHELL", "wget -q --spider http://localhost:80/health || exit 1"]
interval: 10s
timeout: 5s
retries: 3
start_period: 20s
promtail:
image: grafana/promtail:2.8.0
volumes:
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/log:/var/log:ro
configs:
- source: promtail_config
target: /etc/promtail/promtail.yml
command: -config.file=/etc/promtail/promtail.yml
deploy:
mode: global
restart_policy:
condition: on-failure
autoscaler:
image: ghcr.io/ringcentral/docker-swarm-autoscaler:latest
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
environment:
- PROMETHEUS_URL=http://192.168.10.100:9090
- SCAN_INTERVAL=30s
deploy:
placement:
constraints:
- node.role == manager
restart_policy:
condition: on-failure
configs:
nginx_conf:
inline: |
server {
listen 80;
location / {
root /usr/share/nginx/html;
index index.html;
}
location /health {
access_log off;
return 200 'OK';
}
}
promtail_config:
inline: |
server {
http_listen_port: 9080
grpc_listen_port: 0
}
positions:
filename: /tmp/positions.yaml
clients:
- url: {{ loki_url }}
scrape_configs:
- job_name: swarm-containers
static_configs:
- targets: [localhost]
labels:
job: container-logs
__path__: /var/lib/docker/containers/*/*-json.log
pipeline_stages:
- json:
expressions:
log: log
stream: stream
time: time
- timestamp:
source: time
format: RFC3339Nano
5. 一键部署与运维验证
在 Ansible 控制端追加如下 Task,实现将应用栈下发并跑起来。
- name: 部署生产级 Docker Stack
hosts: swarm_managers
run_once: true
tasks:
- name: 创建部署临时目录
file:
path: /tmp/docker-stack
state: directory
mode: '0755'
- name: 渲染 Stack 模板
template:
src: ../templates/docker-stack.yml.j2
dest: /tmp/docker-stack/docker-stack.yml
mode: '0644'
- name: 一键部署/更新 Docker Stack
shell: docker stack deploy -c /tmp/docker-stack/docker-stack.yml prod_service
执行命令一键跑通:
ansible-playbook -i inventory.ini playbooks/deploy_swarm.yml
6. 生产监控与高可用验证实操
集群拉起后,可执行以下命令对架构设计中的几个核心指标进行生产级验收:
验证 1:零停机平滑发布(Rolling Update)
当我们修改 app 服务的镜像或配置并重新部署时,Swarm 将根据 order: start-first 指标工作。
- 观察容器行为:
watch docker service ps prod_service_app - 你会看到:新容器先被启动(Status: Starting),通过健康检查(Healthcheck Test)确认为可用后,老容器才会被停止。整个过程流量不发生中断。
验证 2:健康检查熔断
模拟 app 服务发生故障(例如误删除健康页面文件):
docker exec -it $(docker ps -q -f name=prod_service_app.1) rm /usr/share/nginx/html/health
- 表现:10 秒后,该容器状态变为
unhealthy。 - Swarm 调度:Swarm 会立即在其他空闲节点上拉起一个全新的容器,当新容器健康检查通过后,将旧容器强制剔除并下线。
验证 3:集中式日志推送
- 查看 Promtail 服务是否在每台宿主机正确抓取数据并推送到 Loki:
docker service logs prod_service_promtail - 登录 Grafana(接入 Loki 数据源),在 Explore 页面输入
{job="container-logs"}查询,即可直观地按服务名称、物理节点名称多维度检索整套 Swarm 集群内所有容器生成的标准输出(stdout)和错误输出(stderr)日志。
通过这套方案,你可以用最轻量的架构架构起一套具备弹性伸缩、灰度滚动发布、统一日志收集的高可用系统。相比复杂的 Kubernetes,这套一键式的 Swarm 方案极大地降低了日常维护的工作量,是不可多得的生产落地利器。