Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMe

阅读：评论：0

前言

今天我们的yarn集群出现了一个奇怪的问题，在资源足够的情况下，提交的job一直处在ACCEPTED状态，不能运行。
我们的集群是CDH-5.13.3-1.cdh5.13.3.p0.2，提交到root.users下的任何一个queue(root.users.hive和)的job都不能运行，提交到root.default的job可以运行。但是我们不使用root.default，这就等于yarn集群不能工作了。

定位

名为root.users的queue有足够的资源，但是不能运行job，这就排除了queue的原因。

查看yarn日志(/var/log/hadoop-yarn路径下)，发现近一天的时间里频繁出现如下内容：

2019-05-20 00:51:36,885 INFO org.apache.hadoop.sourcemanager.RMAppManager: Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 10000, removing app application_1556100181928_29377 from state store.
2019-05-20 00:51:36,887 INFO org.apache.hadoop.sourcemanager.RMAppManager: Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 10000, removing app application_1556100181928_29377 from memory:

这是因为yarn里完成的job会存储在内存中，存储的数量是有限制的，当到达上限时便不能在运行新的job，并且打印上述的日志。

与该配置相关的参数如下：

Parameter	Description
sourcemanager.max-completed-applications	The maximum number of completed applications RM keeps. Default value: 10000 Default source: l
sourcemanager.state-store.max-completed-applications	The maximum number of completed applications RM state store keeps, less than or equals to ${sourcemanager.max-completed-applications}. By default, it equals to ${sourcemanager.max-completed-applications}. This ensures that the applications kept in the state store are consistent with the applications remembered in RM memory. Any values larger than ${sourcemanager.max-completed-applications} will be reset to ${sourcemanager.max-completed-applications}. Note that this value impacts the RM recovery performance.Typically, a smaller value indicates better performance on RM recovery.Default value: ${sourcemanager.max-completed-applications} Default source: l
sourcemanager.store.class	The class to use as the persistent store. If org.apache.hadoop.very.ZKRMStateStore is used, the store is implicitly fenced; meaning a single ResourceManager is able to use the store at any point in time. More details on this implicit fencing, along with setting up appropriate ACLs is discussed -node.acl.Default value: org.apache.hadoop.very.FileSystemRMStateStore Default source: l
sourcemanager.zk-state-store.parent-path	Full path of the ZooKeeper znode where RM state will be stored. This must be supplied when using org.apache.hadoop.very.ZKRMStateStore as the value sourcemanager.store.class。Default value: /rmstore Default source: l

解决方法

查看sourcemanager.store.class，查看使用了哪种存储方式。

> grep -B 1 -A sourcemanager.store.class /opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop-yarn/etc/l<property><name&sourcemanager.store.class</name><value>org.apache.hadoop.very.ZKRMStateStore</value></property>

由于集群使用了zookeeper作为存储系统，去zookeeper查看有多少个已经完成的job：

> echo "ls /rmstore/ZKRMStateRoot/RMAppRoot" | /opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/zookeeper/bin/zkCli.sh | grep application_ | awk -F , '{print NF}'
100040

生成删除/rmstore/ZKRMStateRoot/RMAppRoot下的节点的命令：

echo "ls /rmstore/ZKRMStateRoot/RMAppRoot" | 
/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/zookeeper/bin/zkCli.sh |
grep application_ |
while read item; do echo ${item#*[}; done |
while read item; do echo ${item%*]}; done |
awk -F ', ' '{ for (i=1;i<=NF;i++) printf "rmr /rmstore/ZKRMStateRoot/RMAppRoot/%sn",$i}' >

attention:

当前running的job，也会出现在zookeeper的/rmstore/ZKRMStateRoot/RMAppRoot节点下，这里要注意别把他们给删除了，可以在生成命令的时候与***yarn application -list***配合过滤掉running状态的job。

执行删除/rmstore/ZKRMStateRoot/RMAppRoot下的节点的命令：

 | /opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/zookeeper/bin/zkCli.sh

参考

Default YARN Parameters

FileSystem Vs ZKStateStore for RM recovery

Yarn crash [max number of completed apps kept in memory met]

本文发布于:2024-01-28 12:18:48，感谢您对本站的认可！

本文链接：https://www.4u4v.net/it/17064155317367.html

上一篇：解决 The following packages have been kept back 问题

下一篇：httpclient 在http1.1下出现 Connection can be kept alive indefinitely , 出现长连接,返回异常

标签：max number Application expired completed

留言与评论（共有 0 条评论）