某日23:24分,服务器告警down,上服务器发现已经被重启,查看监控各项基础指标(cpu、内存、网络)均处于正常状态,接着查看系统日志/var/log/messages和/var/log/dmesg发现并没有明显的报错说明,于是发现系统产生了dump,开始着手分析:
1、首先安装对应内核版本的kernel-debuginfo,一般有两个包:kernel-debuginfo和kernel-debuginfo-common,由于我的系统是centos7,停止维护后,yum安装失败,所以通过rpm安装:
(base) [root@bj-self-ai-test-122-1 /var/log]# uname -r 3.10.0-1160.118.1.el7.x86_64 (base) [root@bj-self-ai-test-122-1 /data1/jjl/debuginfo]# ll total 516468 -rw-r--r-- 1 root root 463524436 Sep 25 17:19 kernel-debuginfo-3.10.0-1160.118.1.el7.x86_64.rpm -rw-r--r-- 1 root root 65325832 Sep 25 17:09 kernel-debuginfo-common-x86_64-3.10.0-1160.118.1.el7.x86_64.rpm (base) [root@bj-self-ai-test-122-1 /data1/jjl/debuginfo]# ls /usr/lib/debug/lib/ (base) [root@bj-self-ai-test-122-1 /data1/jjl/debuginfo]# rpm -ivh *.rpm (base) [root@bj-self-ai-test-122-1 /data1/jjl/debuginfo]# ls /usr/lib/debug/lib/modules/3.10.0-1160.118.1.el7.x86_64/vmlinux
2、通过系统产生的dump文件进行分析,文件一般产生在/var/crash/目录下:
(base) [root@bj-self-ai-test-122-1 /data1/jjl/debuginfo]# crash /usr/lib/debug/lib/modules/3.10.0-1160.118.1.el7.x86_64/vmlinux /var/crash/127.0.0.1-2024-09-23-23\:22\:04/vmcore crash 7.2.3-11.el7_9.1 Copyright (C) 2002-2017 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... WARNING: kernel relocated [82MB]: patching 87501 gdb minimal_symbol values KERNEL: /usr/lib/debug/lib/modules/3.10.0-1160.118.1.el7.x86_64/vmlinux DUMPFILE: /var/crash/127.0.0.1-2024-09-23-23:22:04/vmcore [PARTIAL DUMP] CPUS: 80 DATE: Mon Sep 23 23:21:52 2024 UPTIME: 8 days, 07:15:30 LOAD AVERAGE: 13.26, 14.40, 15.60 TASKS: 8194 NODENAME: bj-self-ai-test-122-1 RELEASE: 3.10.0-1160.118.1.el7.x86_64 VERSION: #1 SMP Wed Apr 24 16:01:50 UTC 2024 MACHINE: x86_64 (2500 Mhz) MEMORY: 255.7 GB PANIC: "BUG: unable to handle kernel paging request at 00000005a3200018" PID: 153923 COMMAND: "python3" TASK: ffff9e336c82e300 [THREAD_INFO: ffff9e1591f24000] CPU: 6 STATE: TASK_RUNNING (PANIC) crash> 这里发现关键信息: PANIC: "BUG: unable to handle kernel paging request at 00000005a3200018" ,定位到这个Python进程。 crash> bt PID: 153923 TASK: ffff9e336c82e300 CPU: 6 COMMAND: "python3" #0 [ffff9e1591f277d0] machine_kexec at ffffffff86269854 #1 [ffff9e1591f27830] __crash_kexec at ffffffff86329f12 #2 [ffff9e1591f27900] crash_kexec at ffffffff8632a008 #3 [ffff9e1591f27918] oops_end at ffffffff869bc818 #4 [ffff9e1591f27940] no_context at ffffffff8627974c #5 [ffff9e1591f27990] __bad_area_nosemaphore at ffffffff86279a2a #6 [ffff9e1591f279e0] bad_area_nosemaphore at ffffffff86279b54 #7 [ffff9e1591f279f0] __do_page_fault at ffffffff869bf8d0 #8 [ffff9e1591f27a60] do_page_fault at ffffffff869bfb05 #9 [ffff9e1591f27a90] page_fault at ffffffff869bb7b8 [exception RIP: _nv043176rm+29] RIP: ffffffffc3ae9bdd RSP: ffff9e1591f27b48 RFLAGS: 00010286 RAX: 00000005a3200000 RBX: ffff9e2b36ebdb78 RCX: ffff9e14ebe91808 RDX: ffffffffffffffd8 RSI: ffff9e0b7c5ef830 RDI: ffff9e14ebe94830 RBP: ffff9e2b36ebdae0 R8: ffffffffffffffd8 R9: ffff9e2b36ebd9c0 R10: ffff9e2b36ebdb2c R11: ffffffffc63b8470 R12: 0000000000000000 R13: ffff9e14ebe94830 R14: ffff9e2dc5ff6030 R15: ffff9df76e046a98 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff9e1591f27b48] _nv016187rm at ffffffffc3ae9370 [nvidia] #11 [ffff9e1591f27b68] _nv025065rm at ffffffffc36efa4f [nvidia] #12 [ffff9e1591f27b98] _nv046572rm at ffffffffc36dddd5 [nvidia] #13 [ffff9e1591f27bd8] _nv043262rm at ffffffffc3ae9819 [nvidia] #14 [ffff9e1591f27c08] _nv045217rm at ffffffffc3ae699c [nvidia] #15 [ffff9e1591f27c38] _nv013230rm at ffffffffc32ac7b5 [nvidia] #16 [ffff9e1591f27c68] _nv043406rm at ffffffffc32ad109 [nvidia] #17 [ffff9e1591f27ca8] _nv011755rm at ffffffffc32b2026 [nvidia] #18 [ffff9e1591f27cd8] _nv000715rm at ffffffffc3c89f61 [nvidia] #19 [ffff9e1591f27d28] rm_ioctl at ffffffffc3c90b88 [nvidia] #20 [ffff9e1591f27e08] nvidia_ioctl at ffffffffc32065fb [nvidia] #21 [ffff9e1591f27e80] nvidia_frontend_unlocked_ioctl at ffffffffc321960b [nvidia] #22 [ffff9e1591f27e90] do_vfs_ioctl at ffffffff864719d8 #23 [ffff9e1591f27f10] sys_ioctl at ffffffff86471c71 #24 [ffff9e1591f27f50] tracesys at ffffffff869c562e (via system_call) RIP: 00007ff441dffb3f RSP: 00007ff3927b4b70 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000000020 RCX: ffffffffffffffff RDX: 00007ff3927b4d00 RSI: 00000000c020462a RDI: 0000000000000066 RBP: 00007ff3927b4c20 R8: 00007ff3927b4d00 R9: 00007ff3927b4d1c R10: 0000000000000000 R11: 0000000000000246 R12: 00007ff3927b4d00 R13: 0000000000000066 R14: 00000000c020462a R15: 000000000000002a ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b crash> 通过BT查看是在ffffffff869c562e 这里出现异常 crash> dis -l ffffffff869c562e /usr/src/debug/kernel-3.10.0-1160.118.1.el7/linux-3.10.0-1160.118.1.el7.x86_64/arch/x86/kernel/entry_64.S: 644 0xffffffff869c562e <tracesys+166>: mov %rax,0x50(%rsp) crash> 通过反汇编查看异常文件 crash> cat -n /usr/src/debug/kernel-3.10.0-1160.118.1.el7/linux-3.10.0-1160.118.1.el7.x86_64/arch/x86/kernel/entry_64.S 636 #ifdef CONFIG_RETPOLINE 637 movq sys_call_table(, %rax, 8), %rax 638 call __x86_indirect_thunk_rax 639 #else 640 call *sys_call_table(, %rax, 8) # XXX: rip relative 641 #endif 642 643 UNWIND_END_OF_STACK 644 movq %rax,RAX(%rsp) 645 1: RESTORE_REST 646 /* Use IRET because user could have changed frame */ 647 648 /* 649 * Syscall return path ending with IRET. 650 * Has correct top of stack, but partial stack frame. 651 */ 652 GLOBAL(int_ret_from_sys_call) 通过查看文件定位到具体代码,到此分析过程结束,,,,
参考链接:https://blog.csdn.net/weixin_44517278/article/details/134796414