最近搭建了一个集群给测试环境用,配置参数从生产拷贝过来的并不适用。启动和使用过程中,出现了几个异常。问题逐一排查下来,也增加了对hbase内部机制的理解。
异常问题分析思路建议如下:
1)分析日志上下文,重点看Error/Fatal异常错误。
2)查看源码
3)实在不行google/bing搜索
1、HMaster启动split log异常,导致HMaster不能启动
原因:磁盘空间写满。(云主机测试集群磁盘空间小@@)
表现错误是split log异常,实际原因是底层HDFS异常。
hbase具体的错误日志如下:
2016-05-24 14:18:19,886 ERROR org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler: Caught M_META_SERVER_SHUTDOWN, count=1 java.io.IOException: failed log splitting for hadoop-dn03,60020,1463108975915, will retry at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:84) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: java.io.IOException: error or interrupted while splitting logs in [hdfs://mercury:8020/hbase/WALs/hadoop-dn03,60020,1463108975915-splitting] Task = installed = 5 done = 4 error = 1 at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:289) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391) at org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:306) at org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:297) at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:77) ... 4 more
datanode错误日志:
2016-05-23 13:51:13,703 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 2 to reach 3. For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
2、最少regionserver启动参数过大导致HMaster不能正常启动
2016-05-11 21:17:10,794 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 3, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. 2016-05-11 21:17:10,924 INFO org.apache.hadoop.hbase.master.ServerManager: Registering server=hadoop-dn03,60020,1462972426716 2016-05-11 21:17:10,945 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 1, slept for 151 ms, expecting minimum of 3, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. 2016-05-11 21:17:10,968 WARN org.apache.hadoop.hbase.ipc.RpcServer: (responseTooSlow): {"processingtimems":17,"call":"RegionServerStartup(org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStartupRequest)","client":"10.199.245.52:58534","starttimems":1462972630912,"queuetimems":7,"class":"HMaster","responsesize":173,"method":"RegionServerStartup"} 2016-05-11 21:17:12,448 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 1, slept for 1654 ms, expecting minimum of 3, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
根据hbase的hmaster启动代码发现,master检测到一定数据量的regionserver启动后,才继续执行后面的初始化操作。
int minToStart = this.master.getConfiguration(). getInt(WAIT_ON_REGIONSERVERS_MINTOSTART, defaultMinToStart); while (!this.master.isStopped() && count < maxToStart && (lastCountChange+interval > now || timeout > slept || count < minToStart)) { // Log some info at every interval time or if there is a change if (oldCount != count || lastLogTime+interval < now){ lastLogTime = now; String msg = "Waiting for region servers count to settle; currently"+ " checked in " + count + ", slept for " + slept + " ms," + " expecting minimum of " + minToStart + ", maximum of "+ maxToStart+ ", timeout of "+timeout+" ms, interval of "+interval+" ms."; LOG.info(msg); status.setStatus(msg); }
合理设置hbase.master.wait.on.regionservers.mintostart这个参数
3 datanode因failed.volumes.tolerated过大启动异常
原因:dfs.datanode.failed.volumes.tolerated 参数值大于实际磁盘数
该参数定义默认值是0,datanode最大容忍的磁盘损坏的数量。这个参数值不能大于dfs.datanode.data.dir配置的个数。
2016-05-11 19:14:15,703 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool(Datanode Uuid unassigned) service to hadoop-nn2/10.199.245.50:8020. Exiting. org.apache.hadoop.util.DiskChecker$DiskErrorException: Invalid volume failure config value: 2 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl. (FsDatasetImpl.java:243) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:34) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:30) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1335) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1292) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:321) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:862) at java.lang.Thread.run(Thread.java:722)
设置dfs.datanode.failed.volumes.tolerated参数为合理的值,或者不设置。