Troubleshooting Common Kernel Issues on Solaris (SPARC)

Troubleshooting Common Kernel Issues on Solaris (SPARC)

Overview

This article covers practical steps to diagnose and resolve common kernel problems on Solaris running on SPARC hardware: boot failures, panics (panicstr), hangs, performance regressions, and device/driver issues. Assumptions: you have root access, console or IPMI/KVM access, and recent backups. Commands use Solaris ⁄11 conventions; adjust paths if your installation differs.

1. Collecting Initial Diagnostics

  1. Boot logs: Check /var/adm/messages and /var/adm/messages.*.gz for kernel messages and oopses.
  2. Crash dumps: Verify /var/crash for vmcore files and use crash(1M) or mdb(1M) to analyze.
  3. Panic message: Record the panic string and stack trace printed on console or saved in messages.
  4. Hardware console: Use OBP/firmware (ok prompt) and IPMI/KVM to capture early boot failures.

2. Boot Failures or Kernel Not Found

  1. Check boot device and menu:
    • Use “boot -s” from server console to enter single-user, or set OBP boot-device.
  2. Verify /etc/vfstab and root filesystem:
    • From single-user: run fsck -F ufs /dev/rdsk/… or zpool import/online for ZFS.
  3. Reinstall boot archive (Solaris ⁄11):
    • Solaris 10: reinstall using bootadm or install-from-media recovery.
    • Solaris 11: bootadm update-archive and beadm if using BE management.
  4. Firmware mismatch: Ensure OBP and PROM levels support the kernel; update firmware if needed.

3. Kernel Panics

  1. Capture panic output: Copy panicstr and stack trace; note last kernel threads and module names.
  2. Analyze vmcore:
    • Use mdb: mdb -k /var/crash/hostname.0/vmcore
    • Common commands: ::stack, ::ps, ::status, ::panic, ::trace
  3. Isolate offending module:
    • Look for nth module in stack trace (e.g., driver name). Boot with -B nodma or use -m flags to disable problem modules.
  4. Reproduce under controlled load: Use test harness or stress tools to trigger and validate fixes.
  5. Mitigation: Apply patches from Oracle/Solaris providers, disable problematic drivers, or roll back recent kernel updates.

4. System Hangs / Unresponsive Systems

  1. Differentiate hang types:
    • Complete freeze (no console interaction) vs. soft hang (system processes stuck).
  2. Use OBP and IPMI: If kernel is unresponsive, use hardware reset and preserve logs.
  3. Use dladm, kstat, iostat, prstat: Identify I/O, CPU, or network saturation.
  4. Kernel debugging hooks:
    • Enable console logging and netconsole if available.
  5. Check locks and deadlocks:
    • Use mdb to inspect thread states and lock holders: ::thread, ::locks, ::cvlist.

5. Performance Regressions

  1. Establish baseline: Compare current kstat, mpstat, vmstat with baseline.
  2. CPU and interrupt profiling:
    • Use psrinfo -pv, mpstat, and kstat -p irq to spot interrupt storms or CPU hot spots.
  3. Memory pressure:
    • Check swap, anon memory, and segmap usage via vmstat, swap -s, prstat -m.
  4. Scheduler issues:
    • Tune via projects/pri, use cfgadm for device-affinity issues on SPARC.
  5. ZFS and filesystem tuning: Monitor zpool status, zfs get all, and adjust ARC size if necessary.

6. Device and Driver Problems

  1. Identify failed devices: dmesg, prtconf, and cfgadm list show devices and drivers.
  2. Driver versions and patches: Match driver versions to OS patches; update from Oracle support.
  3. Reconfigure or remove faulty hardware: Try hot-swap or move devices to different slots.
  4. Blacklist or unload modules: Use modunload or update driver binding; reboot may be required.

7. Kernel Panics During Upgrades or Patching

  1. Use Boot Environments (Solaris 11): Create BE before patching with beadm create and test by activating BE.
  2. Follow patch prerequisites: Check patch dependencies and read release notes.
  3. Rollback plan: Keep an alternate BE or backup kernel to revert quickly.

8. Using mdb and crash Analysis Examples

  • Basic stack trace: mdb -k vmcore ::stack
  • List processes: mdb -k vmcore ::ps
  • Inspect module symbols: mdb -k -r kernel ::modinfo
  • For produced results, match symbol names to drivers and search vendor/Oracle bug database.

9. Preventive Practices

  • Keep firmware, PROM, and Solaris patched and matched.
  • Maintain regular backups and use Boot Environments.
  • Enable centralized logging and remote serial console capture.
  • Test patches in staging or on non-production BEs.

10. When to Contact Vendor Support

  • Reproducible panics with stack traces pointing to kernel internals.
  • Hardware faults indicated by OBP or IPMI.
  • If patches are required or when root cause points to proprietary drivers—collect vmcore, /var/adm/messages, and dmesg when contacting support.

Quick Troubleshooting Checklist

  • Save panic messages and vmcore.
  • Check /var/adm/messages and dmesg.
  • Boot single-user or alternate BE.
  • Run fsck or zpool import/online.
  • Use mdb/crash to analyze kernels.
  • Apply vendor patches or rollback BE.

If you want, I can produce specific mdb/crash commands for a given panic trace or format a checklist tailored to Solaris 10 vs Solaris 11.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *