Troubleshooting Common Kernel Issues on Solaris (SPARC)
Overview
This article covers practical steps to diagnose and resolve common kernel problems on Solaris running on SPARC hardware: boot failures, panics (panicstr), hangs, performance regressions, and device/driver issues. Assumptions: you have root access, console or IPMI/KVM access, and recent backups. Commands use Solaris ⁄11 conventions; adjust paths if your installation differs.
1. Collecting Initial Diagnostics
- Boot logs: Check /var/adm/messages and /var/adm/messages.*.gz for kernel messages and oopses.
- Crash dumps: Verify /var/crash for vmcore files and use crash(1M) or mdb(1M) to analyze.
- Panic message: Record the panic string and stack trace printed on console or saved in messages.
- Hardware console: Use OBP/firmware (ok prompt) and IPMI/KVM to capture early boot failures.
2. Boot Failures or Kernel Not Found
- Check boot device and menu:
- Use “boot -s” from server console to enter single-user, or set OBP boot-device.
- Verify /etc/vfstab and root filesystem:
- From single-user: run fsck -F ufs /dev/rdsk/… or zpool import/online for ZFS.
- Reinstall boot archive (Solaris ⁄11):
- Solaris 10: reinstall using bootadm or install-from-media recovery.
- Solaris 11: bootadm update-archive and beadm if using BE management.
- Firmware mismatch: Ensure OBP and PROM levels support the kernel; update firmware if needed.
3. Kernel Panics
- Capture panic output: Copy panicstr and stack trace; note last kernel threads and module names.
- Analyze vmcore:
- Use mdb: mdb -k /var/crash/hostname.0/vmcore
- Common commands: ::stack, ::ps, ::status, ::panic, ::trace
- Isolate offending module:
- Look for nth module in stack trace (e.g., driver name). Boot with -B nodma or use -m flags to disable problem modules.
- Reproduce under controlled load: Use test harness or stress tools to trigger and validate fixes.
- Mitigation: Apply patches from Oracle/Solaris providers, disable problematic drivers, or roll back recent kernel updates.
4. System Hangs / Unresponsive Systems
- Differentiate hang types:
- Complete freeze (no console interaction) vs. soft hang (system processes stuck).
- Use OBP and IPMI: If kernel is unresponsive, use hardware reset and preserve logs.
- Use dladm, kstat, iostat, prstat: Identify I/O, CPU, or network saturation.
- Kernel debugging hooks:
- Enable console logging and netconsole if available.
- Check locks and deadlocks:
- Use mdb to inspect thread states and lock holders: ::thread, ::locks, ::cvlist.
5. Performance Regressions
- Establish baseline: Compare current kstat, mpstat, vmstat with baseline.
- CPU and interrupt profiling:
- Use psrinfo -pv, mpstat, and kstat -p irq to spot interrupt storms or CPU hot spots.
- Memory pressure:
- Check swap, anon memory, and segmap usage via vmstat, swap -s, prstat -m.
- Scheduler issues:
- Tune via projects/pri, use cfgadm for device-affinity issues on SPARC.
- ZFS and filesystem tuning: Monitor zpool status, zfs get all, and adjust ARC size if necessary.
6. Device and Driver Problems
- Identify failed devices: dmesg, prtconf, and cfgadm list show devices and drivers.
- Driver versions and patches: Match driver versions to OS patches; update from Oracle support.
- Reconfigure or remove faulty hardware: Try hot-swap or move devices to different slots.
- Blacklist or unload modules: Use modunload or update driver binding; reboot may be required.
7. Kernel Panics During Upgrades or Patching
- Use Boot Environments (Solaris 11): Create BE before patching with beadm create and test by activating BE.
- Follow patch prerequisites: Check patch dependencies and read release notes.
- Rollback plan: Keep an alternate BE or backup kernel to revert quickly.
8. Using mdb and crash Analysis Examples
- Basic stack trace: mdb -k vmcore ::stack
- List processes: mdb -k vmcore ::ps
- Inspect module symbols: mdb -k -r kernel ::modinfo
- For produced results, match symbol names to drivers and search vendor/Oracle bug database.
9. Preventive Practices
- Keep firmware, PROM, and Solaris patched and matched.
- Maintain regular backups and use Boot Environments.
- Enable centralized logging and remote serial console capture.
- Test patches in staging or on non-production BEs.
10. When to Contact Vendor Support
- Reproducible panics with stack traces pointing to kernel internals.
- Hardware faults indicated by OBP or IPMI.
- If patches are required or when root cause points to proprietary drivers—collect vmcore, /var/adm/messages, and dmesg when contacting support.
Quick Troubleshooting Checklist
- Save panic messages and vmcore.
- Check /var/adm/messages and dmesg.
- Boot single-user or alternate BE.
- Run fsck or zpool import/online.
- Use mdb/crash to analyze kernels.
- Apply vendor patches or rollback BE.
If you want, I can produce specific mdb/crash commands for a given panic trace or format a checklist tailored to Solaris 10 vs Solaris 11.
Leave a Reply