Hbase/Datanode几个异常问题

最近搭建了一个集群给测试环境用,配置参数从生产拷贝过来的并不适用。启动和使用过程中,出现了几个异常。问题逐一排查下来,也增加了对hbase内部机制的理解。
异常问题分析思路建议如下:
1)分析日志上下文,重点看Error/Fatal异常错误。
2)查看源码
3)实在不行google/bing搜索

1、HMaster启动split log异常,导致HMaster不能启动

原因:磁盘空间写满。(云主机测试集群磁盘空间小@@)
表现错误是split log异常,实际原因是底层HDFS异常。
hbase具体的错误日志如下:

2016-05-24 14:18:19,886 ERROR org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler: Caught M_META_SERVER_SHUTDOWN, count=1
java.io.IOException: failed log splitting for hadoop-dn03,60020,1463108975915, will retry
        at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:84)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.IOException: error or interrupted while splitting logs in [hdfs://mercury:8020/hbase/WALs/hadoop-dn03,60020,1463108975915-splitting] Task = installed = 5 done = 4 error = 1
        at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:289)
        at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
        at org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:306)
        at org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:297)
        at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:77)
        ... 4 more

datanode错误日志:

2016-05-23 13:51:13,703 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 2 to reach 3. For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy



2、最少regionserver启动参数过大导致HMaster不能正常启动

2016-05-11 21:17:10,794 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 3, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
2016-05-11 21:17:10,924 INFO org.apache.hadoop.hbase.master.ServerManager: Registering server=hadoop-dn03,60020,1462972426716
2016-05-11 21:17:10,945 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 1, slept for 151 ms, expecting minimum of 3, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
2016-05-11 21:17:10,968 WARN org.apache.hadoop.hbase.ipc.RpcServer: (responseTooSlow): {"processingtimems":17,"call":"RegionServerStartup(org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStartupRequest)","client":"10.199.245.52:58534","starttimems":1462972630912,"queuetimems":7,"class":"HMaster","responsesize":173,"method":"RegionServerStartup"}
2016-05-11 21:17:12,448 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 1, slept for 1654 ms, expecting minimum of 3, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.

根据hbase的hmaster启动代码发现,master检测到一定数据量的regionserver启动后,才继续执行后面的初始化操作。

int minToStart = this.master.getConfiguration().
      getInt(WAIT_ON_REGIONSERVERS_MINTOSTART, defaultMinToStart);
    while (!this.master.isStopped() && count < maxToStart
        && (lastCountChange+interval > now || timeout > slept || count < minToStart)) {
      // Log some info at every interval time or if there is a change
      if (oldCount != count || lastLogTime+interval < now){
        lastLogTime = now;
        String msg =
          "Waiting for region servers count to settle; currently"+
            " checked in " + count + ", slept for " + slept + " ms," +
            " expecting minimum of " + minToStart + ", maximum of "+ maxToStart+
            ", timeout of "+timeout+" ms, interval of "+interval+" ms.";
        LOG.info(msg);
        status.setStatus(msg);
      }

合理设置hbase.master.wait.on.regionservers.mintostart这个参数

3 datanode因failed.volumes.tolerated过大启动异常
原因:dfs.datanode.failed.volumes.tolerated 参数值大于实际磁盘数
该参数定义默认值是0,datanode最大容忍的磁盘损坏的数量。这个参数值不能大于dfs.datanode.data.dir配置的个数。

2016-05-11 19:14:15,703 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool  (Datanode Uuid unassigned) service to hadoop-nn2/10.199.245.50:8020. Exiting.
org.apache.hadoop.util.DiskChecker$DiskErrorException: Invalid volume failure  config value: 2
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.(FsDatasetImpl.java:243)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:34)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:30)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1335)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1292)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:321)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:862)
        at java.lang.Thread.run(Thread.java:722)

设置dfs.datanode.failed.volumes.tolerated参数为合理的值,或者不设置。

此条目发表在hbase分类目录。将固定链接加入收藏夹。

发表评论

电子邮件地址不会被公开。 必填项已用*标注

您可以使用这些HTML标签和属性: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>