System Monitoring & Logging

Log Analysis and Management

System Logs with journalctl

# Basic log viewing
journalctl                    # View all logs
journalctl -f                 # Follow logs in real-time
journalctl --since "2023-01-01" --until "2023-01-02"  # Date range
journalctl --since "1 hour ago"  # Recent logs

# Service-specific logs
journalctl -u nginx.service   # Specific service logs
journalctl -u ssh.service -f  # Follow SSH logs
journalctl -u mysql.service --since today  # Today's MySQL logs

# Log filtering
journalctl -p err             # Error level and above
journalctl -p warning..err    # Warning to error range
journalctl -p debug           # Debug level logs
journalctl -k                 # Kernel messages only
journalctl -b                 # Current boot logs
journalctl -b -1              # Previous boot logs

# Advanced filtering
journalctl _PID=1234          # Logs from specific PID
journalctl _UID=1000          # Logs from specific user
journalctl _COMM=sshd         # Logs from specific command
journalctl PRIORITY=3         # Logs with priority 3 (err)

Traditional Syslog Management

# Syslog file locations
tail -f /var/log/syslog       # General system log
tail -f /var/log/auth.log     # Authentication log
tail -f /var/log/kern.log     # Kernel log
tail -f /var/log/mail.log     # Mail system log
tail -f /var/log/daemon.log   # Daemon log

# Common log files
/var/log/messages             # General system messages
/var/log/secure               # Security/authentication log (RHEL/CentOS)
/var/log/cron                 # Cron job logs
/var/log/boot.log             # Boot process log
/var/log/dmesg                # Hardware/driver messages

# Log analysis with grep
grep "error" /var/log/syslog  # Find error messages
grep -i "failed" /var/log/auth.log  # Case-insensitive failed logins
grep "$(date '+%b %d')" /var/log/syslog  # Today's logs

rsyslog Configuration

# rsyslog configuration
/etc/rsyslog.conf             # Main configuration file
/etc/rsyslog.d/               # Additional configuration directory

# Common rsyslog rules
*.info;mail.none;authpriv.none;cron.none    /var/log/messages
authpriv.*                                  /var/log/secure
mail.*                                      /var/log/maillog
cron.*                                      /var/log/cron

# Remote logging setup
*.* @@remote-server:514       # Send all logs to remote server (TCP)
*.* @remote-server:514        # Send all logs to remote server (UDP)

# Restart rsyslog
sudo systemctl restart rsyslog
sudo systemctl reload rsyslog

Log Rotation and Archiving

# logrotate configuration
/etc/logrotate.conf           # Main configuration
/etc/logrotate.d/             # Service-specific configurations

# Example logrotate configuration
/var/log/myapp/*.log {
    daily                     # Rotate daily
    rotate 30                 # Keep 30 days
    compress                  # Compress old logs
    delaycompress            # Delay compression for one cycle
    missingok                # Don't error if log is missing
    notifempty               # Don't rotate empty logs
    create 644 root root     # Create new log file with permissions
    postrotate
        /bin/kill -HUP $(cat /var/run/myapp.pid 2>/dev/null) 2>/dev/null || true
    endscript
}

# Manual log rotation
sudo logrotate /etc/logrotate.conf  # Run logrotate manually
sudo logrotate -d /etc/logrotate.conf  # Debug mode
sudo logrotate -f /etc/logrotate.conf  # Force rotation

System Monitoring Tools

Nagios Monitoring

# Nagios installation (Ubuntu/Debian)
sudo apt update
sudo apt install nagios3 nagios-plugins

# Configuration files
/etc/nagios3/nagios.cfg       # Main configuration
/etc/nagios3/conf.d/          # Service definitions
/etc/nagios3/commands.cfg     # Command definitions

# Basic host definition
define host {
    use                     linux-server
    host_name               webserver1
    alias                   Web Server 1
    address                 192.168.1.100
    contact_groups          admins
    notification_interval   30
    notification_period     24x7
}

# Basic service definition
define service {
    use                     local-service
    host_name               webserver1
    service_description     HTTP
    check_command           check_http
    notification_interval   30
    notification_period     24x7
}

# Nagios commands
sudo systemctl start nagios3
sudo systemctl enable nagios3
sudo /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg  # Verify configuration

Zabbix Monitoring

# Zabbix agent installation
sudo apt install zabbix-agent

# Zabbix agent configuration
/etc/zabbix/zabbix_agentd.conf

# Key configuration parameters
Server=192.168.1.50           # Zabbix server IP
ServerActive=192.168.1.50     # Zabbix server for active checks
Hostname=webserver1           # Hostname for this agent

# Zabbix agent commands
sudo systemctl start zabbix-agent
sudo systemctl enable zabbix-agent
sudo systemctl status zabbix-agent

# Test Zabbix connectivity
zabbix_get -s 192.168.1.100 -k system.cpu.load[all,avg1]
zabbix_get -s 192.168.1.100 -k vm.memory.size[available]

Prometheus Monitoring

# Prometheus installation
wget https://github.com/prometheus/prometheus/releases/latest
tar xvfz prometheus-*.tar.gz
cd prometheus-*
./prometheus --config.file=prometheus.yml

# Prometheus configuration (prometheus.yml)
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

# Node Exporter for system metrics
wget https://github.com/prometheus/node_exporter/releases/latest
tar xvfz node_exporter-*.tar.gz
cd node_exporter-*
./node_exporter

# Common Prometheus queries
up                            # Check if targets are up
rate(cpu_seconds_total[5m])   # CPU usage rate
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes  # Memory usage

Alert Configuration and Notification

Email Alerting Setup

# Configure mail system
sudo apt install mailutils postfix

# Test email sending
echo "Test message" | mail -s "Test Subject" admin@domain.com

# Nagios email notifications
define contact {
    contact_name            admin
    alias                   System Administrator
    email                   admin@domain.com
    service_notification_period    24x7
    host_notification_period       24x7
    service_notification_commands  notify-service-by-email
    host_notification_commands     notify-host-by-email
}

# Custom alert script
#!/bin/bash
# /usr/local/bin/alert.sh
THRESHOLD=80
USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $USAGE -gt $THRESHOLD ]; then
    echo "Disk usage is $USAGE%" | mail -s "Disk Alert" admin@domain.com
fi

Slack/Teams Integration

# Slack webhook script
#!/bin/bash
WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
MESSAGE="$1"
curl -X POST -H 'Content-type: application/json' \
    --data "{\"text\":\"$MESSAGE\"}" \
    $WEBHOOK_URL

# Usage
./slack-alert.sh "Server load is high: $(uptime)"

Prometheus Alertmanager

# Alertmanager configuration (alertmanager.yml)
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@domain.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
- name: 'web.hook'
  email_configs:
  - to: 'admin@domain.com'
    subject: 'Alert: {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      {{ end }}

# Prometheus alert rules
groups:
- name: system
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "CPU usage is above 80% for 5 minutes"

Performance Metrics Collection

System Performance Metrics

# CPU metrics
sar -u 1 60                   # CPU utilization
mpstat -P ALL 1               # Per-CPU statistics
pidstat -u 1                  # Per-process CPU usage

# Memory metrics
sar -r 1 60                   # Memory utilization
sar -S 1 60                   # Swap utilization
pidstat -r 1                  # Per-process memory usage

# Disk I/O metrics
sar -d 1 60                   # Disk activity
iostat -x 1                   # Extended disk statistics
pidstat -d 1                  # Per-process disk I/O

# Network metrics
sar -n DEV 1 60               # Network device statistics
sar -n EDEV 1 60              # Network error statistics
ss -s                         # Socket statistics

Custom Metrics Collection

# Create custom metrics script
#!/bin/bash
# /usr/local/bin/collect-metrics.sh
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
MEM_USAGE=$(free | grep Mem | awk '{printf "%.2f", $3/$2 * 100.0}')
DISK_USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
LOAD_AVG=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f1)

echo "$TIMESTAMP,CPU:$CPU_USAGE,MEM:$MEM_USAGE,DISK:$DISK_USAGE,LOAD:$LOAD_AVG" >> /var/log/metrics.csv

# Run every minute via cron
# * * * * * /usr/local/bin/collect-metrics.sh

Application Performance Monitoring

# Java application monitoring
jstat -gc PID 1s              # Garbage collection statistics
jstack PID                    # Thread dump
jmap -dump:format=b,file=heap.hprof PID  # Heap dump

# Database monitoring
mysqladmin processlist        # MySQL process list
mysqladmin status             # MySQL status
pg_stat_activity             # PostgreSQL activity (in psql)

# Web server monitoring
curl -w "@curl-format.txt" http://localhost/  # Response time
ab -n 100 -c 10 http://localhost/  # Apache bench

Centralized Logging

ELK Stack Setup

# Elasticsearch installation
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-7.x.list
sudo apt update && sudo apt install elasticsearch

# Elasticsearch configuration
/etc/elasticsearch/elasticsearch.yml
cluster.name: my-cluster
node.name: node-1
network.host: localhost
http.port: 9200

# Logstash configuration
/etc/logstash/conf.d/syslog.conf
input {
  file {
    path => "/var/log/syslog"
    start_position => "beginning"
  }
}

filter {
  grok {
    match => { "message" => "%{SYSLOGTIMESTAMP:timestamp} %{IPORHOST:host} %{WORD:program}: %{GREEDYDATA:message}" }
  }
  date {
    match => [ "timestamp", "MMM  d HH:mm:ss", "MMM dd HH:mm:ss" ]
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "syslog-%{+YYYY.MM.dd}"
  }
}

# Kibana configuration
/etc/kibana/kibana.yml
server.port: 5601
server.host: "localhost"
elasticsearch.hosts: ["http://localhost:9200"]

Fluentd Configuration

# Fluentd installation
gem install fluentd

# Fluentd configuration
/etc/fluent/fluent.conf
<source>
  @type tail
  path /var/log/syslog
  pos_file /var/log/fluent/syslog.log.pos
  tag syslog
  format syslog
</source>

<match syslog>
  @type elasticsearch
  host localhost
  port 9200
  index_name syslog
  type_name syslog
</match>

# Start Fluentd
fluentd -c /etc/fluent/fluent.conf -d /var/log/fluent/fluentd.pid

Rsyslog to Elasticsearch

# Install rsyslog elasticsearch module
sudo apt install rsyslog-elasticsearch

# Configure rsyslog for Elasticsearch
/etc/rsyslog.d/elasticsearch.conf
module(load="omelasticsearch")
*.* action(type="omelasticsearch"
           server="localhost"
           serverport="9200"
           template="StdJSONFmt"
           searchIndex="syslog-index"
           dynSearchIndex="on")

# Restart rsyslog
sudo systemctl restart rsyslog

Real-time Monitoring Dashboards

Grafana Setup

# Grafana installation
sudo apt-get install -y adduser libfontconfig1
wget https://dl.grafana.com/oss/release/grafana_latest_amd64.deb
sudo dpkg -i grafana_latest_amd64.deb

# Start Grafana
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

# Access Grafana: http://localhost:3000 (admin/admin)

# Add Prometheus data source
URL: http://localhost:9090

Custom Dashboard Creation

# Example Grafana dashboard JSON
{
  "dashboard": {
    "title": "System Overview",
    "panels": [
      {
        "title": "CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "CPU Usage %"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
            "legendFormat": "Memory Usage %"
          }
        ]
      }
    ]
  }
}

Terminal-based Dashboards

# htop with custom configuration
htop -C                       # Use color
htop -d 1                     # Update every second
htop -u username              # Show only user processes

# nmon for comprehensive monitoring
nmon                          # Interactive mode
nmon -f -s 5 -c 720          # Record to file (5 sec intervals, 1 hour)

# tmux monitoring dashboard
tmux new-session -d 'htop'
tmux split-window -h 'watch -n 1 "df -h"'
tmux split-window -v 'tail -f /var/log/syslog'
tmux select-pane -t 0
tmux split-window -v 'watch -n 1 "free -h"'
tmux attach

Log Parsing and Analysis Tools

Advanced Log Analysis

# awk for log analysis
awk '{print $1}' /var/log/access.log | sort | uniq -c | sort -nr  # IP frequency
awk '$9 == 404 {print $7}' /var/log/access.log | sort | uniq -c   # 404 errors
awk '{sum += $10} END {print "Total bytes:", sum}' /var/log/access.log  # Total bytes

# sed for log processing
sed -n '1000,2000p' /var/log/syslog  # Lines 1000-2000
sed '/ERROR/!d' /var/log/app.log     # Only ERROR lines
sed 's/192.168.1.100/SERVER1/g' /var/log/syslog  # Replace IP with name

# grep advanced patterns
grep -E "(error|ERROR|Error)" /var/log/syslog    # Case variations
grep -v "INFO" /var/log/app.log | grep -E "(WARN|ERROR)"  # Exclude INFO, show WARN/ERROR
grep -B 5 -A 5 "OutOfMemory" /var/log/app.log    # 5 lines before/after
grep -r "connection refused" /var/log/           # Recursive search

GoAccess Web Log Analyzer

# Install GoAccess
sudo apt install goaccess

# Real-time web log analysis
goaccess /var/log/nginx/access.log --log-format=COMBINED
goaccess /var/log/apache2/access.log --log-format=COMBINED

# Generate HTML report
goaccess /var/log/nginx/access.log -o /var/www/html/report.html --log-format=COMBINED --real-time-html

# Custom log format
goaccess /var/log/custom.log --log-format='%h %^[%d:%t %^] "%r" %s %b "%R" "%u"'

Logwatch System

# Install Logwatch
sudo apt install logwatch

# Configuration
/etc/logwatch/conf/logwatch.conf
MailTo = admin@domain.com
MailFrom = logwatch@server.com
Range = yesterday
Detail = Med

# Generate report
logwatch --detail Med --mailto admin@domain.com --range yesterday
logwatch --service sshd --print  # Specific service report
logwatch --service http --range today --detail High

System Health Checks

Automated Health Monitoring

# System health check script
#!/bin/bash
# /usr/local/bin/health-check.sh

# Check CPU load
LOAD=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f1 | xargs)
if (( $(echo "$LOAD > 4.0" | bc -l) )); then
    echo "WARNING: High CPU load: $LOAD"
fi

# Check memory usage
MEM_USAGE=$(free | grep Mem | awk '{printf "%.2f", $3/$2 * 100.0}')
if (( $(echo "$MEM_USAGE > 90" | bc -l) )); then
    echo "WARNING: High memory usage: $MEM_USAGE%"
fi

# Check disk usage
df -h | awk 'NR>1 {gsub(/%/, "", $5); if($5 > 90) print "WARNING: High disk usage on", $6, ":", $5"%"}'

# Check running services
for service in nginx mysql ssh; do
    if ! systemctl is-active --quiet $service; then
        echo "ERROR: Service $service is not running"
    fi
done

# Check log errors
ERROR_COUNT=$(grep -c "ERROR" /var/log/syslog)
if [ $ERROR_COUNT -gt 10 ]; then
    echo "WARNING: $ERROR_COUNT errors found in syslog"
fi

Service Health Monitoring

# Service monitoring script
#!/bin/bash
# /usr/local/bin/service-monitor.sh

SERVICES=("nginx" "mysql" "redis" "ssh")
for service in "${SERVICES[@]}"; do
    if systemctl is-active --quiet $service; then
        echo "$service: OK"
    else
        echo "$service: FAILED"
        systemctl restart $service
        sleep 5
        if systemctl is-active --quiet $service; then
            echo "$service: RECOVERED"
        else
            echo "$service: RESTART FAILED" | mail -s "Service Alert" admin@domain.com
        fi
    fi
done

Network Health Checks

# Network connectivity check
#!/bin/bash
# /usr/local/bin/network-check.sh

HOSTS=("8.8.8.8" "google.com" "internal-server.local")
for host in "${HOSTS[@]}"; do
    if ping -c 1 $host &> /dev/null; then
        echo "$host: OK"
    else
        echo "$host: UNREACHABLE"
    fi
done

# Port connectivity check
PORTS=("80" "443" "22" "3306")
for port in "${PORTS[@]}"; do
    if nc -z localhost $port; then
        echo "Port $port: OPEN"
    else
        echo "Port $port: CLOSED"
    fi
done

Monitoring Automation

Cron-based Monitoring

# Crontab for monitoring tasks
# Edit with: crontab -e

# System metrics every minute
* * * * * /usr/local/bin/collect-metrics.sh

# Health check every 5 minutes
*/5 * * * * /usr/local/bin/health-check.sh

# Log rotation check daily
0 2 * * * /usr/sbin/logrotate /etc/logrotate.conf

# Weekly system report
0 8 * * 1 /usr/local/bin/weekly-report.sh | mail -s "Weekly System Report" admin@domain.com

# Disk usage alert daily
0 9 * * * /usr/local/bin/disk-usage-alert.sh

Systemd Service Monitoring

# Create monitoring service
/etc/systemd/system/system-monitor.service
[Unit]
Description=System Monitor
After=network.target

[Service]
Type=simple
User=root
ExecStart=/usr/local/bin/system-monitor.sh
Restart=always
RestartSec=30

[Install]
WantedBy=multi-user.target

# Enable and start service
sudo systemctl enable system-monitor.service
sudo systemctl start system-monitor.service

Monitoring with Ansible

# monitoring-playbook.yml
---
- hosts: all
  tasks:
    - name: Check system load
      shell: uptime | awk -F'load average:' '{print $2}' | cut -d, -f1
      register: load_avg

    - name: Alert on high load
      mail:
        to: admin@domain.com
        subject: 'High Load Alert'
        body: 'System load is {{ load_avg.stdout }}'
      when: load_avg.stdout|float > 4.0

    - name: Check disk usage
      shell: df / | awk 'NR==2 {print $5}' | sed 's/%//'
      register: disk_usage

    - name: Alert on disk usage
      mail:
        to: admin@domain.com
        subject: 'Disk Usage Alert'
        body: 'Disk usage is {{ disk_usage.stdout }}%'
      when: disk_usage.stdout|int > 90

Best Practices

Monitoring Strategy

Establish baselines - Understand normal system behavior
Monitor continuously - Use automated tools for 24/7 monitoring
Set meaningful thresholds - Avoid alert fatigue with proper limits
Implement redundancy - Use multiple monitoring systems
Document everything - Maintain runbooks and procedures

Log Management Best Practices

Centralize logs - Use centralized logging solutions
Implement log rotation - Prevent disk space issues
Use structured logging - JSON or similar formats for easier parsing
Set retention policies - Balance storage costs with compliance needs
Monitor log volume - Detect anomalies in log generation

Security Considerations

Secure log files - Proper permissions and access controls
Encrypt log transmission - Use TLS for remote logging
Audit log access - Monitor who accesses sensitive logs
Regular security reviews - Check monitoring system security
Backup monitoring data - Protect against data loss

Performance Optimization

Efficient log parsing - Use appropriate tools for log analysis
Optimize queries - Efficient database queries for metrics
Use compression - Reduce storage requirements
Implement caching - Cache frequently accessed metrics
Regular maintenance - Clean up old data and optimize systems

Log Analysis and Management​

System Logs with journalctl​

Traditional Syslog Management​

rsyslog Configuration​

Log Rotation and Archiving​

System Monitoring Tools​

Nagios Monitoring​

Zabbix Monitoring​

Prometheus Monitoring​

Alert Configuration and Notification​

Email Alerting Setup​

Slack/Teams Integration​

Prometheus Alertmanager​

Performance Metrics Collection​

System Performance Metrics​

Custom Metrics Collection​

Application Performance Monitoring​

Centralized Logging​

ELK Stack Setup​

Fluentd Configuration​

Rsyslog to Elasticsearch​

Real-time Monitoring Dashboards​

Grafana Setup​

Custom Dashboard Creation​

Terminal-based Dashboards​

Log Parsing and Analysis Tools​

Advanced Log Analysis​

GoAccess Web Log Analyzer​

Logwatch System​

System Health Checks​

Automated Health Monitoring​

Service Health Monitoring​

Network Health Checks​

Monitoring Automation​

Cron-based Monitoring​

Systemd Service Monitoring​

Monitoring with Ansible​

Best Practices​

Monitoring Strategy​

Log Management Best Practices​

Security Considerations​

Performance Optimization​

Log Analysis and Management

System Logs with journalctl

Traditional Syslog Management

rsyslog Configuration

Log Rotation and Archiving

System Monitoring Tools

Nagios Monitoring

Zabbix Monitoring

Prometheus Monitoring

Alert Configuration and Notification

Email Alerting Setup

Slack/Teams Integration

Prometheus Alertmanager

Performance Metrics Collection

System Performance Metrics

Custom Metrics Collection

Application Performance Monitoring

Centralized Logging

ELK Stack Setup

Fluentd Configuration

Rsyslog to Elasticsearch

Real-time Monitoring Dashboards

Grafana Setup

Custom Dashboard Creation

Terminal-based Dashboards

Log Parsing and Analysis Tools

Advanced Log Analysis

GoAccess Web Log Analyzer

Logwatch System

System Health Checks

Automated Health Monitoring

Service Health Monitoring

Network Health Checks

Monitoring Automation

Cron-based Monitoring

Systemd Service Monitoring

Monitoring with Ansible

Best Practices

Monitoring Strategy

Log Management Best Practices

Security Considerations

Performance Optimization