NodeManger重启时恢复killed job的container

直接上错误日志,日志中提示分配资源给container_e11_1531648435560_0733_01_000003出现异常。
继续往上搜索日志,发现对应的application已经出于killed by user状态。

Container exited with a non-zero exit code 143
, ExitStatus: 143, Priority: 0], [container_e11_1531648435560_0733_01_000003, CreateTime: 1534861451578, State: RUNNING, Capability: <memory:1024, vCores:1>, Diagnostics: , ExitStatus: -1000, Priority: 0], [con
tainer_e11_1531648435560_0735_01_000001, CreateTime: 1534861268376, State: COMPLETE, Capability: <memory:1024, vCores:1>, Diagnostics: Container [pid=9636,containerID=container_e11_1531648435560_0735_01_000001]
 is running beyond virtual memory limits. Current usage: 400.4 MB of 1 GB physical memory used; 2.5 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_e11_1531648435560_0735_01_000001 :
        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
        |- 9649 9636 9636 9636 (java) 4552 1390 2524225536 102195 /etc/alternatives/java_sdk/bin/java -Xmx424m -Dbackend.checkpoint.dir=hdfs://thflink/flink/store/checkpoints/rapido-gateway -Dlog.file=/data0/ha
doop/yarn/log/application_1531648435560_0735/container_e11_1531648435560_0735_01_000001/jobmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.y
arn.YarnApplicationMasterRunner 
        |- 9636 9634 9636 9636 (bash) 0 0 116060160 300 /bin/bash -c /etc/alternatives/java_sdk/bin/java -Xmx424m -Dbackend.checkpoint.dir=hdfs://thflink/flink/store/checkpoints/rapido-gateway -Dlog.file=/data0
/hadoop/yarn/log/application_1531648435560_0735/container_e11_1531648435560_0735_01_000001/jobmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flin
k.yarn.YarnApplicationMasterRunner  1> /data0/hadoop/yarn/log/application_1531648435560_0735/container_e11_1531648435560_0735_01_000001/jobmanager.out 2> /data0/hadoop/yarn/log/application_1531648435560_0735/co
ntainer_e11_1531648435560_0735_01_000001/jobmanager.err 

这时候想到YARN fault tolerance相关的一个参数,yarn.nodemanager.recovery.enabled,默认为true,设置为false关闭即可解决恢复上面异常问题。

PS:通过日志ERROR信息,第一时间想到的是资源不足的问题,去调整container的资源分配,不管怎么设置,都没解决该问题。
百般无奈之下再去分析日志,原来是在恢复一个已经不存在的container。

此条目发表在Hadoop分类目录,贴了标签。将固定链接加入收藏夹。

发表评论

电子邮件地址不会被公开。 必填项已用*标注

This site uses Akismet to reduce spam. Learn how your comment data is processed.