SOP Add Zabbix monitoring to the releng compose hosts
Resources
-
[1] Ansible Zabbix module for managing templates: https://docs.ansible.com/ansible/latest/collections/community/zabbix/zabbix_template_module.html
-
[2] Zabbix Sender pushing metrics: https://www.zabbix.com/documentation/6.0/en/manpages/zabbix_sender
-
[3] Fedora Infra Docs: https://docs.fedoraproject.org/en-US/infra/sysadmin_guide/#_standard_operating_procedures
-
[4] Fedora Infra Docs Git Repo: https://pagure.io/infra-docs-fpo
-
[5] Zabbix Staging Server: https://zabbix.stg.fedoraproject.org
-
[6] Targetting groups for specific Ansible tasks: https://stackoverflow.com/questions/21008083/run-task-only-if-host-does-not-belong-to-a-group
-
[7] Zabbix Plugins: https://www.zabbix.com/documentation/guidelines/en/plugins
-
[8] Zabbix Scripts: https://www.zabbix.com/documentation/6.0/en/manual/web_interface/frontend_sections/administration/scripts
-
[9] Monitor running time of specific process with Zabbix: https://www.zabbix.com/forum/zabbix-help/13350-monitor-running-time-of-a-specific-process
-
[10] Zabbix proc num: https://www.zabbix.com/documentation/current/en/manual/appendix/items/proc_mem_num_notes
-
[11] Zabbix added to Releng Hosts PR: https://pagure.io/fedora-infra/ansible/pull-request/1653#
-
[12] Zabbix kvm Virtual Host template: https://www.zabbix.com/integrations/kvm
-
[13] Releng Ansible cronjob installation: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/releng/tasks/main.yml
-
[14] Fedora Infra Ticket: https://pagure.io/fedora-infrastructure/issue/11577
-
[15] Failed Compose Monitoring: https://pagure.io/releng/failed-composes/issues
-
[16] Fedora Infra Releng Compose Monitoring Software: https://pagure.io/releng/compose-tracker
-
[17] Zabbix Production Server: https://zabbix.fedoraproject.org
-
[18] Zabbix ansible host group: https://docs.ansible.com/ansible/latest/collections/community/zabbix/zabbix_group_module.html
Releng Machine List
The following machines are those which are relevant to Releng.
machines: [releng_compose] compose-x86-01.iad2.fedoraproject.org compose-branched01.iad2.fedoraproject.org compose-rawhide01.iad2.fedoraproject.org compose-iot01.iad2.fedoraproject.org [releng_compose_stg] compose-x86-01.stg.iad2.fedoraproject.org
First install the Zabbix agent on these releng_compose:releng_compose_stg
hosts via the zabbix/zabbix_agent
ansible role [11]. We targetted the groups/releng-compose.yml
playbook as this is responsible for targetting these hosts.
diff --git a/playbooks/groups/releng-compose.yml b/playbooks/groups/releng-compose.yml index 04b68aba4f..69c0acdad3 100644 --- a/playbooks/groups/releng-compose.yml +++ b/playbooks/groups/releng-compose.yml @@ -28,6 +28,8 @@ - ipa/client - rkhunter - nagios_client + - zabbix/zabbix_agent - collectd/base - sudo - role: keytab/service
Run the playbook like so sudo rbac-playbook groups/releng-compose.yml
on the batcave01
host. Then check the Zabbix console hosts section to ensure the new hosts have been picked up by Zabbix[5][17]. To get access to the Zabbix server, your FAS user must be a member of the group sysadmin-noc
, then run the playbook sudo rbac-playbook groups/zabbix.yml
. Once run you can then authenticate via FAS on the Zabbix web console.
Requirements
There is no compose being run in the staging environment at all, so this is unfortunately going to be need to be implemented on the production environment only.
Existing monitoring is in place to track composes fails or finishes with success, however there is currently no monitoring to track when a compose hangs.
Cronjobs are installed on the releng hosts via the following ansible task[13]. There are a total of 8 cronjobs in total.
-
1: ftbfs weekly cron job
"ftbfs.cron" /etc/cron.weekly/ on compose-x86-01
-
2: branched compose cron
"branched" /etc/cron.d/branched on compose-branched01.iad2
-
3: rawhide compose cron
"rawhide" etc/cron.d/rawhide on compose-rawhide01.iad2
-
4: cloud-updates compose cron
"cloud-updates" /etc/cron.d/cloud-updates on compose-x86-01.iad2
-
5: container-updates compose cron
"container-updates" /etc/cron.d/container-updates on compose-x86-01.iad2
-
6: clean-amis cron
"clean-amis.j2" /etc/cron.d/clean-amis on compose-x86-01.iad2
-
7: rawhide-iot compose cron
"rawhide-iot" /etc/cron.d/rawhide-iot on compose-iot-01.iad2
-
8: sig_policy cron
"sig_policy.j2" /etc/cron.d/sig_policy on compose-x86-01.iad2'
Need at least one Zabbix check per cronjob. The Zabbix check should do the following.
-
When a cronjob starts: — create a file in
/tmp/name-of-cron-job
-
When a cronjob ends: — delete the file in
/tmp/name-of-cron-job
-
If file exists, assume cron is running and if file exists for more than a set period, assume the cron job is stalled.
Implementation
-
Create a custom template called
fedora releng compose cronjobs
. -
Create a host group called
fedora releng compose
. -
Add the ansible hosts from the group
releng_compose
in production only since we currently don’t do composes in staging, to this host group. -
In this template create an item, one for each cronjob.
-
In this template create a trigger, one for each cronjob. Initially set the trigger to alert when the item returns true for more than 1 hour. This can be changed later when we understand just how long these cron jobs run for.
-
Implement this template in JSON see [12] for inspiration and format examples. This template can then be placed in
roles/zabbix/zabbix_server/files/zabbix_templates/releng_compose_cronjobs.json
. -
Create a task in the
roles/zabbix/zabbix_server/tasks
to make use of the zabbix_api key to create this template on the server see [1]. -
Use the community Ansible role for adding this template to the releng hosts.
-
Update each cronjob in Ansible, to create the files such as
/tmp/name-of-cron-job
when starting, and deleting when completed.
Create a host group
- name: Create host groups # set task level variables as we change ansible_connection plugin here community.zabbix.zabbix_group: state: present host_groups: "{{ item['hostgroup'] }}" with_items: "{{ zabbix_templates }}" # Hostgroups specific to an ansible group can be overridden in inventory/group_vars/group_name run_once: True tags: - zabbix_hostgroups vars: ansible_zabbix_auth_key: "{{ (env == 'staging')|ternary(zabbix_stg_apikey, zabbix_apikey) }}" ansible_network_os: community.zabbix.zabbix ansible_connection: httpapi ansible_httpapi_port: 443 ansible_httpapi_use_ssl: true ansible_httpapi_validate_certs: false ansible_host: "{{ (env == 'staging')|ternary(zabbix_stg_hostname, zabbix_hostname) }}" ansible_zabbix_url_path: "" # If Zabbix WebUI runs on non-default (zabbix) path ,e.g. http://<FQDN>/zabbixeu
Add production releng_compose hosts to the Zabbix host group
- name: Add hosts to hostgroups community.zabbix.zabbix_host: host_name: "{{ inventory_hostname }}" host_groups: "{{ item['hostgroup']}}" # link_templates: "{{ item['template'] }}" # We're adding the template to hostgroups in a seperate step, may not be required. force: false with_items: "{{ zabbix_templates }}" tags: - zabbix_add_hosts_to_hostgroups vars: ansible_zabbix_auth_key: "{{ (env == 'staging')|ternary(zabbix_stg_apikey, zabbix_apikey) }}" ansible_network_os: community.zabbix.zabbix ansible_connection: httpapi ansible_httpapi_port: 443 ansible_httpapi_use_ssl: true ansible_httpapi_validate_certs: false ansible_host: "{{ (env == 'staging')|ternary(zabbix_stg_hostname, zabbix_hostname) }}" ansible_zabbix_url_path: "" # If Zabbix WebUI runs on non-default (zabbix) path ,e.g. http://<FQDN>/zabbixeu
Import a custom template
Using the zabbix ansible role community.zabbix.zabbix_template
, create a template:
Make sure to use JSON format. It might be best to use the Zabbix UI to configure initially, and then export the template. Make sure that the JSON template is minimised before importing back into Zabbix.
#- name: Get Zabbix template as JSON # community.zabbix.zabbix_template_info: # template_name: fedora releng compose cronjobs # format: json # omit_date: yes # register: zabbix_template_json #- name: Write Zabbix templte to JSON file # local_action: # module: copy # content: "{{ zabbix_template_json['template_json'] }}" # dest: "roles/zabbix_server/files/zabbix_templates/releng_compose_cronjobs.json" - name: Import Zabbix templates from JSON community.zabbix.zabbix_template: template_json: "{{ lookup('file', item['template'] ) }}" state: present with_items: "{{ zabbix_templates }}" # Templates specific to an ansible group, can be overwridden in inventory/group_vars/group_name tags: - zabbix_templates vars: ansible_zabbix_auth_key: "{{ (env == 'staging')|ternary(zabbix_stg_apikey, zabbix_apikey) }}" ansible_network_os: community.zabbix.zabbix ansible_connection: httpapi ansible_httpapi_port: 443 ansible_httpapi_use_ssl: true ansible_httpapi_validate_certs: false ansible_host: "{{ (env == 'staging')|ternary(zabbix_stg_hostname, zabbix_hostname) }}" ansible_zabbix_url_path: "" # If Zabbix WebUI runs on non-default (zabbix) path ,e.g. http://<FQDN>/zabbixeu
Add template to host groups
- name: Add templates to hosts community.zabbix.zabbix_host: host_name: "{{ inventory_hostname }}" host_groups: "{{ item['hostgroup']}}" link_templates: "{{ item['template'] }}" force: false with_items: "{{ zabbix_templates }}" tags: - zabbix_add_templates_to_hosts vars: ansible_zabbix_auth_key: "{{ (env == 'staging')|ternary(zabbix_stg_apikey, zabbix_apikey) }}" ansible_network_os: community.zabbix.zabbix ansible_connection: httpapi ansible_httpapi_port: 443 ansible_httpapi_use_ssl: true ansible_httpapi_validate_certs: false ansible_host: "{{ (env == 'staging')|ternary(zabbix_stg_hostname, zabbix_hostname) }}" ansible_zabbix_url_path: "" # If Zabbix WebUI runs on non-default (zabbix) path ,e.g. http://<FQDN>/zabbixeu
In this template create an item, one for each cronjob
-
Configure the type as zabbix agent active
-
Configure history to 7d for 1 week
-
Configure the resolution to 60m to check every hour
-
Configure the key to match something like the following, changing the * to what ever the name of the cronjob is, eg rawhide
vfs.file.exists[/tmp/fedora-compose-*]
In this template create a trigger, one for each cronjob.
-
Configure the trigger to 8 hours.
-
Configure the severity to high
-
In this example the
releng_compose_cronjobs.json
is the name of the template, it makes it generic, and when the template is applied to a host, it gains the triggers and items contained in the template. -
Configure the expression to something like the following, changing the * to what ever the name of the file in the key in the matching item
last(/releng_compose_cronjobs.json/vfs.file.exists[/tmp/fedora-compose-branched])=1 and min(/releng_compose_cronjobs.json/vfs.file.exists[/tmp/fedora-compose-branched],8h)>0
Modify each cronjob in ansible
-
When a cronjob starts: — create a file in
/tmp/name-of-cron-job
-
When a cronjob ends: — delete the file in
/tmp/name-of-cron-job
-
If file exists, assume cron is running and if file exists for more than a set period, assume the cron job is stalled.
Fedora Ansible Group Vars for Zabbix
The following var structure is required to configure this new zabbix_template
role. See the example structure in the inventory/groups/releng_compose
for production:
zabbix_templates: - group: "releng_compose" template: "releng_compose_cronjobs.json" hostgroup: "fedora releng compose"
And for staging:
zabbix_templates: "{{ [] }}"
Currently we do not run composes in staging. So I’ve not activated this role on the staging machines. Ordinarially, make sure to add the same vars in prod and staging environments.
Each element in the list, should be used to link a single Zabbix template to a Zabbix hostgroup. group parameter is not currently used, but it should be set to the Ansible group name for documentation purposes.
To use this role going forward:
-
Add template.json files to roles/zabbix/zabbix_templates/files
-
Add a
zabbix_templates
var to theinventory/groups/groupname
file that matches the ansible group -
Import the role in the template corresponding with this ansible group eg:
# playbooks/groups/releng-compose.yml:33 roles: ... - zabbix/zabbix_templates ...
Want to help? Learn how to contribute to Fedora Docs ›