Chaos Toolkit 是一个工具用来简化做混乱工程实验. 以下是官方文档中关于这个这个工具的说明:
The Chaos Toolkit aims to be the simplest and easiest way to explore building your own Chaos Engineering Experiments.
下面通过官方提供的示例应用来说明如何使用 chaos toolkit.
mkdir demo && cd demo
pipenv --python 3.7
# pipenv install -U pip
pip3 install -U chaostoolkit
pip3 install -U chaostoolkit-kubernetes
pip3 install -U chaostoolkit-reporting
brew install cairo pango gdk-pixbuf libffi
使用chaostoolkit-demo 中的 a-simple-walkthrough
中的文件复制到 demo 目录.
安装依赖:
pipenv install astral
pipenv install cherrypy
pipenv install pytz
pipenv install requests
运行:
cp valid-cert.pem cert.pem
nohup python3 astre.py &
nohup python3 subset.py &
运行 chaos:
chaos run experiment.json
result:
[2019-05-20 14:35:29 INFO] Validating the experiment's syntax
[2019-05-20 14:35:29 INFO] Experiment looks valid
[2019-05-20 14:35:29 INFO] Running experiment: What is the impact of an expired certificate on our application chain?
[2019-05-20 14:35:29 INFO] Steady state hypothesis: Application responds
[2019-05-20 14:35:29 INFO] Probe: the-astre-service-must-be-running
[2019-05-20 14:35:29 CRITICAL] Steady state probe 'the-astre-service-must-be-running' is not in the given tolerance so failing this experiment
[2019-05-20 14:35:29 INFO] Let's rollback...
[2019-05-20 14:35:29 INFO] Rollback: swap-to-vald-cert
[2019-05-20 14:35:29 INFO] Action: swap-to-vald-cert
[2019-05-20 14:35:29 INFO] Rollback: None
[2019-05-20 14:35:29 INFO] Action: restart-astre-service-to-pick-up-certificate
[2019-05-20 14:35:29 INFO] Rollback: None
[2019-05-20 14:35:29 INFO] Action: restart-sunset-service-to-pick-up-certificate
[2019-05-20 14:35:29 INFO] Pausing after activity for 1s...
[2019-05-20 14:35:30 INFO] Experiment ended with status: failed
有问题实验的信息:
[2019-05-20 14:35:29 CRITICAL] Steady state probe 'the-astre-service-must-be-running' is not in the given tolerance so failing this experiment
chaos report --export-format=html journal.json report.html
关于如果 chaos 实验的都是通过 experirment.json 这个文件来描述的,主要通过如下方式:
本例子的 chaos 实验是想验证如果 ssl 证书过期之后,验证系统是不是还是工作?
steady-state-hypothesis: json 的片段,确认系统运行,pid 文件存在,返回 http 响应
"steady-state-hypothesis": {
"title": "Application responds",
"probes": [
{
"type": "probe",
"name": "the-astre-service-must-be-running",
"tolerance": true,
"provider": {
"type": "python",
"module": "os.path",
"func": "exists",
"arguments": {
"path": "astre.pid"
}
}
},
{
"type": "probe",
"name": "the-sunset-service-must-be-running",
"tolerance": true,
"provider": {
"type": "python",
"module": "os.path",
"func": "exists",
"arguments": {
"path": "sunset.pid"
}
}
},
{
"type": "probe",
"name": "we-can-request-sunset",
"tolerance": 200,
"provider": {
"type": "http",
"timeout": 3,
"verify_tls": false,
"url": "https://localhost:8443/city/Paris"
}
}
]
}
method: 修改系统状态
{
"type": "action",
"name": "swap-to-expired-cert",
"provider": {
"type": "process",
"path": "cp",
"arguments": "expired-cert.pem cert.pem"
}
},
{
"type": "probe",
"name": "read-tls-cert-expiry-date",
"provider": {
"type": "process",
"path": "openssl",
"arguments": "x509 -enddate -noout -in cert.pem"
}
},
{
"type": "action",
"name": "restart-astre-service-to-pick-up-certificate",
"provider": {
"type": "process",
"path": "pkill",
"arguments": "--echo -HUP -F astre.pid"
}
},
{
"type": "action",
"name": "restart-sunset-service-to-pick-up-certificate",
"provider": {
"type": "process",
"path": "pkill",
"arguments": "--echo -HUP -F sunset.pid"
},
"pauses": {
"after": 1
}
}
以上是一次 chaos toolkit 实用的全过程,如何使用还是比较清楚,个人认为关键是如何进行扩展不同 method 的 action 种类,chaostoolkit-incubator这个仓库中定义了不少插件和扩展,这个准备后面再继续研究一下.