线上事故处理总结

来源：互联网发布：nb仿真实验室软件编辑：程序博客网时间：2024/06/10 08:32

1、线上有些机器cpu idle（cpu空闲时间）和load avg（系统负载）报警；这个跟业务逻辑还不太一样，如果业务逻辑错误还可以看 error日志，看看zabbix上监控的上游服务时间相应等等

到底是什么原因引起的呢？我们必须知道系统现在到底在干啥？思路无非就是通过JVM的工具：

1、jstack看进程中的各个线程都在干啥？

2、jmap 看看内存是不是满了？young区&old区等。。。

3、jstat 看看gc次数和时间是否频繁？

但是问题来了，jstack可以dump所有线程的运作情况，但是比较繁忙的是哪些？用用户线程、垃圾回收线程等，很难清晰看出来；jmap 可以看出各区的内存状况，以及存活对象统计等，但是为什么这个对象没有被回收，也很难直观看出来；jstat可以看出gc时间和次数，这只是结果的表现，根本原因还是没找到，特别是处理线上问题，有时候不冷静情况下。

问题采取的手段还是还是有的：

1、查看进程id，如ps aux ｜grep java

2、查看进程中各线程占用cpu状态，选出最繁忙的线程程id，top －Hp pid

3、把进程转成16进制，printf “ ％x\n” ｛线程id｝

4、下一步终于轮到jstack上场了，它用来输出进程｛线程id｝的堆栈信息，然后根据线程ID的十六进制值grep。

这样就可以定位到我们代码是在哪一块线程消耗最多啦，顺藤摸瓜就能找到问题啦。

就这个线上问题，如下图，我们看到idService.putPoiIdsInfoIntoCache()函数频繁试用，原因是之前留了一历史问题没处理好，就是job模块当产生新的poi或者poi信息变更的时候通过zeromq队列发送poiid给thrift模块，使thrift更新poiId，现在已经不用这样机制了，改成直接超苦了。当job频繁发送消息，时候会对thrift模块产生压力，可能就处理不过来了。（修改逻辑之后，报警解除，后续看看产生的影响）

PS：

上面解决问题的那几个套路，可以在VM上运行一个脚本，就可以知道进程中繁忙的top n 线程在搞些啥，供以后排查问题可以试用：

#!/bin/bash# @Function# Find out the highest cpu consumed threads of java, and print the stack of these threads.## @Usage#   $ ./show-busy-java-threads.sh## @author Jerry LeePROG=`basename $0`usage() {    cat <<EOFUsage: ${PROG} [OPTION]...Find out the highest cpu consumed threads of java, and print the stack of these threads.Example: ${PROG} -c 10Options:    -p, --pid       find out the highest cpu consumed threads from the specifed java process,                    default from all java process.    -c, --count     set the thread count to show, default is 5    -h, --help      display this help and exitEOF    exit $1}ARGS=`getopt -n "$PROG" -a -o c:p:h -l count:,pid:,help -- "$@"`[ $? -ne 0 ] && usage 1eval set -- "${ARGS}"while true; do    case "$1" in    -c|--count)        count="$2"        shift 2        ;;    -p|--pid)        pid="$2"        shift 2        ;;    -h|--help)        usage        ;;    --)        shift        break        ;;    esacdonecount=${count:-5}redEcho() {    [ -c /dev/stdout ] && {        # if stdout is console, turn on color output.        echo -ne "\033[1;31m"        echo -n "$@"        echo -e "\033[0m"    } || echo "$@"}## Check the existence of jstack command!if ! which jstack &> /dev/null; then    [ -n "$JAVA_HOME" ] && [ -f "$JAVA_HOME/bin/jstack" ] && [ -x "$JAVA_HOME/bin/jstack" ] && {        export PATH="$JAVA_HOME/bin:$PATH"    } || {        redEcho "Error: jstack not found on PATH and JAVA_HOME!"        exit 1    }fiuuid=`date +%s`_${RANDOM}_$$cleanupWhenExit() {    rm /tmp/${uuid}_* &> /dev/null}trap "cleanupWhenExit" EXITprintStackOfThread() {    while read threadLine ; do        pid=`echo ${threadLine} | awk '{print $1}'`        threadId=`echo ${threadLine} | awk '{print $2}'`        threadId0x=`printf %x ${threadId}`        user=`echo ${threadLine} | awk '{print $3}'`        pcpu=`echo ${threadLine} | awk '{print $5}'`                jstackFile=/tmp/${uuid}_${pid}                [ ! -f "${jstackFile}" ] && {            jstack ${pid} > ${jstackFile} || {                redEcho "Fail to jstack java process ${pid}!"                rm ${jstackFile}                continue            }        }                redEcho "The stack of busy(${pcpu}%) thread(${threadId}/0x${threadId0x}) of java process(${pid}) of user(${user}):"        sed "/nid=0x${threadId0x}/,/^$/p" -n ${jstackFile}    done}[ -z "${pid}" ] && {    ps -Leo pid,lwp,user,comm,pcpu --no-headers | awk '$4=="java"{print $0}' |    sort -k5 -r -n | head --lines "${count}" | printStackOfThread} || {    ps -Leo pid,lwp,user,comm,pcpu --no-headers | awk -v "pid=${pid}" '$1==pid,$4=="java"{print $0}' |    sort -k5 -r -n | head --lines "${count}" | printStackOfThread}

0 0