Configuration & Administration

Expand all | Collapse all

Graphs not working on Zenoss 6

  • 1.  Graphs not working on Zenoss 6

    Posted 12 days ago

    We're running Zenoss 6 and we're having issues with the performance graphs behavior.

    First issue: Many times the graphs show a "Request timeout" message. After refreshing the site sometimes they come back up.

    Second issue: Even when graphs are being displayed, data is not there all the time, there's patches of data, generally after a restart, data seems to be written and then it stops working.

    We discarded network/connectivity issues since there's actually data being written from our monitoring, just not always.

    When running a "zencommand run -v10 -d devID --showfullcommand" test, there's some logs I think might help:

    2018-08-30 20:17:46,699 DEBUG zen.zencommand: _checkZenHub: entry
    
    2018-08-30 20:17:46,699 DEBUG zen.zencommand: ZenHub health check: ZenHub may be down.
    
    2018-08-30 20:17:46,699 DEBUG zen.zencommand: _signalZenHubAnswering(False)
    
    2018-08-30 20:17:46,699 DEBUG zen.zencommand: removing file at /opt/zenoss/var/zenhub_connected


    Also there's this:

    2018-08-30 20:17:47,999 DEBUG zen.publisher: connection closed
    
    2018-08-30 20:17:48,000 DEBUG zen.publisher: response was: {u'type': u'DROPPED', u'value': u'consumer is overwhelmed'}
    
    2018-08-30 20:17:48,000 DEBUG zen.zencommand: Removing service EventService
    
    2018-08-30 20:17:48,000 DEBUG zen.zencommand: Removing service Products.ZenHub.services.CommandPerformanceConfig
    
    2018-08-30 20:17:48,000 DEBUG zen.pbclientfactory: clientConnectionLost [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
    
    2018-08-30 20:17:48,000 DEBUG zen.collector.scheduler: In shutdown stage during
    
    2018-08-30 20:17:48,000 DEBUG zen.collector.scheduler: In shutdown stage during


    Researching on that, I figured there might be an issue at DB level, digging into it I found these logs in opentsdb logs:

    2018-09-05 19:20:19,789 WARN  [AsyncHBase I/O Worker #1] HBaseClient: Probe Exists(table="3a0og15l1umolwkfw2ttgofe0-tsdb", key=[114, -58, -102, 91, -114, 45, 112, 0, 0, 1, 0, 0, 1, 0, 0, 13
    
    , 1, 56, -44, 0, 0, 14, 0, -111, 17, 0, 0, 15, 1, 56, -43, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 58,
    
    65, 115, 121, 110, 99, 72, 66, 97, 115, 101, 126, 112, 114, 111, 98, 101, 126, 60, 59, 95, 60], family=null, qualifiers=null, attempt=0, region=RegionInfo(table="3a0og15l1umolwkfw2ttgofe0-t
    
    sdb", region_name="3a0og15l1umolwkfw2ttgofe0-tsdb,r\x82,1519866886722.4ecd2349772222841cf2e1c186daa4ca.", stop_key="s\x83")) failed
    
    org.hbase.async.NonRecoverableException: Too many attempts: Exists(table="3a0og15l1umolwkfw2ttgofe0-tsdb", key=[114, -58, -102, 91, -114, 45, 112, 0, 0, 1, 0, 0, 1, 0, 0, 13, 1, 56, -44, 0,
    
    0, 14, 0, -111, 17, 0, 0, 15, 1, 56, -43, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 58, 65, 115, 121, 11
    
    0, 99, 72, 66, 97, 115, 101, 126, 112, 114, 111, 98, 101, 126, 60, 59, 95, 60], family=null, qualifiers=null, attempt=11, region=RegionInfo(table="3a0og15l1umolwkfw2ttgofe0-tsdb", region_na
    
    me="3a0og15l1umolwkfw2ttgofe0-tsdb,r\x82,1519866886722.4ecd2349772222841cf2e1c186daa4ca.", stop_key="s\x83"))
    
    at org.hbase.async.HBaseClient.tooManyAttempts(HBaseClient.java:2056) [asynchbase-1.7.2.jar:na]
    
    at org.hbase.async.HBaseClient.sendRpcToRegion(HBaseClient.java:1920) [asynchbase-1.7.2.jar:na]
    
    at org.hbase.async.HBaseClient$1RetryRpc.call(HBaseClient.java:1944) [asynchbase-1.7.2.jar:na]
    
    at org.hbase.async.HBaseClient$1RetryRpc.call(HBaseClient.java:1927) [asynchbase-1.7.2.jar:na]
    
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) [async-1.4.0.jar:na]
    
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) [async-1.4.0.jar:na]
    
    at com.stumbleupon.async.Deferred.callback(Deferred.java:1005) [async-1.4.0.jar:na]
    
    at org.hbase.async.HBaseRpc.callback(HBaseRpc.java:712) [asynchbase-1.7.2.jar:na]
    
    at org.hbase.async.RegionClient.decode(RegionClient.java:1533) [asynchbase-1.7.2.jar:na]
    
    at org.hbase.async.RegionClient.decode(RegionClient.java:88) [asynchbase-1.7.2.jar:na]
    
    at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:485) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) [netty-3.9.4.Final.jar:na]
    
    at org.hbase.async.RegionClient.handleUpstream(RegionClient.java:1223) [asynchbase-1.7.2.jar:na]
    
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.channel.SimpleChannelHandler.messageReceived(SimpleChannelHandler.java:142) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.handler.timeout.IdleStateAwareChannelHandler.handleUpstream(IdleStateAwareChannelHandler.java:36) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.handler.timeout.IdleStateHandler.messageReceived(IdleStateHandler.java:294) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) [netty-3.9.4.Final.jar:na]
    
    at org.hbase.async.HBaseClient$RegionClientPipeline.sendUpstream(HBaseClient.java:3121) [asynchbase-1.7.2.jar:na]
    
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.9.4.Final.jar:na]
    
    at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [netty-3.9.4.Final.jar:na]
    
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
    
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
    
    at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]

    Which in turn, after some research led me to dig into Hbase a little, and found this:

    2018-09-05 15:20:57,750 WARN  [localhost,60200,1536016723348_ChoreService_1] regionserver.HeapMemoryManager: heapOccupancyPercent 0.95467097 is above heap occupancy alarm watermark (0.95)
    
    2018-09-05 15:20:59,293 INFO  [localhost,60200,1536016723348_ChoreService_1] regionserver.HeapMemoryManager: heapOccupancyPercent 0.49049363 is now below the heap occupancy alarm watermark
    
    (0.95)
    
    2018-09-05 15:21:02,456 WARN  [B.defaultRpcServer.handler=3,queue=0,port=60200] ipc.RpcServer: B.defaultRpcServer.handler=3,queue=0,port=60200: caught a ClosedChannelException, this means t
    
    hat the server /0.0.0.0:60200 was processing a request but the client went away. The error message was: null
    
    2018-09-05 15:21:08,078 INFO  [regionserver/localhost/127.0.0.1:60200-shortCompactions-1536016739433] regionserver.HStore: Completed compaction of 3 (all) file(s) in t of 3a0og15l1umolwkfw2
    
    ttgofe0-tsdb,\xE6\xF6,1519866886722.7bc04cd605994233f01b154a797279a3. into 6d084f225bd240e7aeb09818e9eb7bcc(size=126.0 M), total size for store is 126.0 M. This selection was in queue for 0
    
    sec, and took 10sec to execute.

    And this:

    2018-09-05 20:58:47,034 WARN  [B.defaultRpcServer.handler=9,queue=0,port=60200] ipc.RpcServer: (responseTooSlow): {"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)","starttimems":1536181078962,"responsesize":7287,"method":"Multi","param":"region= 3a0og15l1umolwkfw2ttgofe0-tsdb,\\x00\\x10,1519866886722.ed54a51af5742d563637c8ef0b3664be., for 119 actions and 1st row key=\\x00\\x16\\x8B[\\x8F\\xFD\\x80\\x00\\x00\\x01\\x00\\x00\\x01\\x00\\x00\\x0D\\x00\\x93\\xB3\\x00\\x00\\x0E\\x00\\x91\\x04\\x00\\x00\\x0F\\x00\\x93\\xB4","processingtimems":15814,"client":"10.17.0.1:40306","queuetimems":3,"class":"HRegionServer"}
    
    2018-09-05 20:58:47,047 WARN  [PriorityRpcServer.handler=5,queue=1,port=60200] ipc.RpcServer: (responseTooSlow): {"call":"Get(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest)","starttimems":1536181100974,"responsesize":570,"method":"Get","param":"region= hbase:meta,,1, row=3a0og15l1umolwkfw2ttgofe0-tsdb,r\\xE71[\\x83\\x06\\xC0\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00:AsyncHBase~probe~<;_<,:","processingtimems":24669,"client":"10.17.0.1:40708","queuetimems":8058,"class":"HRegionServer"}
    
    2018-09-05 20:58:47,034 WARN  [B.defaultRpcServer.handler=20,queue=2,port=60200] ipc.RpcServer: (responseTooSlow): {"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)","starttimems":1536181078962,"responsesize":7294,"method":"Multi","param":"region= 3a0og15l1umolwkfw2ttgofe0-tsdb,\\x00\\x10,1519866886722.ed54a51af5742d563637c8ef0b3664be., for 120 actions and 1st row key=\\x00\\xE9\\xCA[\\x90\\x19\\xA0\\x00\\x00\\x01\\x00\\x00\\x01\\x00\\x00\\x0D\\x00\\x14S\\x00\\x00\\x0E\\x00\\x12\\xA9\\x00\\x00\\x0F\\x00\\x14T","processingtimems":12318,"client":"10.17.0.1:40306","queuetimems":2,"class":"HRegionServer"}
    
    2018-09-05 20:58:47,047 WARN  [PriorityRpcServer.handler=11,queue=1,port=60200] ipc.RpcServer: (responseTooSlow): {"call":"Get(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest)","starttimems":1536181100974,"responsesize":570,"method":"Get","param":"region= hbase:meta,,1, row=3a0og15l1umolwkfw2ttgofe0-tsdb,\\x18\\x8A\\xB2[\\x8F\\xFD\\x80\\x00\\x00\\x01\\x00\\x00\\x01\\x00\\x00\\x0D\\x01\\x95x\\x00\\x00\\x0E\\x01\\x954\\x00\\x00\\x0F\\x01\\x95y,:","processingtimems":24669,"client":"10.17.0.1:40306","queuetimems":1820,"class":"HRegionServer"}
    
    2018-09-05 20:58:47,062 WARN  [B.defaultRpcServer.handler=29,queue=2,port=60200] ipc.RpcServer: (responseTooSlow): {"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)","starttimems":1536181078961,"responsesize":7290,"method":"Multi","param":"region= 3a0og15l1umolwkfw2ttgofe0-tsdb,\\x00\\x10,1519866886722.ed54a51af5742d563637c8ef0b3664be., for 118 actions and 1st row key=\\x00\\xF9\\x7F[\\x90\\x19\\xA0\\x00\\x00\\x01\\x00\\x00\\x01\\x00\\x00\\x0D\\x00\\x10I\\x00\\x00\\x0E\\x00\\x10J\\x00\\x00\\x0F\\x00\\x10K","processingtimems":13892,"client":"10.17.0.1:40306","queuetimems":4,"class":"HRegionServer"}
    
    2018-09-05 20:58:47,046 WARN  [PriorityRpcServer.handler=13,queue=1,port=60200] ipc.RpcServer: (responseTooSlow): {"call":"Get(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest)","starttimems":1536181100974,"responsesize":570,"method":"Get","param":"region= hbase:meta,,1, row=3a0og15l1umolwkfw2ttgofe0-tsdb,\\x8E\\x02\\xF5[\\x83\\x06\\xC0\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00:AsyncHBase~probe~<;_<,:","processingtimems":24669,"client":"10.17.0.1:40708","queuetimems":8618,"class":"HRegionServer"}
    
    2018-09-05 20:58:47,089 WARN  [PriorityRpcServer.handler=15,queue=1,port=60200] ipc.RpcServer: (responseTooSlow): {"call":"Get(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest)","starttimems":1536181100974,"responsesize":570,"method":"Get","param":"region= hbase:meta,,1, row=3a0og15l1umolwkfw2ttgofe0-tsdb,\\xD0'\\x17[\\x83\\x06\\xC0\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00:AsyncHBase~probe~<;_<,:","processingtimems":24669,"client":"10.17.0.1:40708","queuetimems":8698,"class":"HRegionServer"}
    
    2018-09-05 20:58:47,088 WARN  [PriorityRpcServer.handler=4,queue=0,port=60200] ipc.RpcServer: (responseTooSlow): {"call":"Get(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest)","starttimems":1536181100974,"responsesize":570,"method":"Get","param":"region= hbase:meta,,1, row=3a0og15l1umolwkfw2ttgofe0-tsdb,\\xDF\\xD3\\xE7[\\x83\\x06\\xC0\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00:AsyncHBase~probe~<;_<,:","processingtimems":24669,"client":"10.17.0.1:40708","queuetimems":8238,"class":"HRegionServer"}
    
    2018-09-05 20:58:47,062 WARN  [PriorityRpcServer.handler=14,queue=0,port=60200] ipc.RpcServer: (responseTooSlow): {"call":"Get(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest)","starttimems":1536181100974,"responsesize":570,"method":"Get","param":"region= hbase:meta,,1, row=3a0og15l1umolwkfw2ttgofe0-tsdb,\\x18\\xD7\\xC9[\\x83\\x06\\xC0\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00:AsyncHBase~probe~<;_<,:","processingtimems":24669,"client":"10.17.0.1:40708","queuetimems":8618,"class":"HRegionServer"}
    
    2018-09-05 20:58:47,089 WARN  [B.defaultRpcServer.handler=24,queue=0,port=60200] ipc.RpcServer: (responseTooSlow): {"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)","starttimems":1536181078961,"responsesize":7293,"method":"Multi","param":"region= 3a0og15l1umolwkfw2ttgofe0-tsdb,\\x00\\x10,1519866886722.ed54a51af5742d563637c8ef0b3664be., for 121 actions and 1st row key=\\x00\\xC8~[\\x90\\x19\\xA0\\x00\\x00\\x01\\x00\\x00\\x01\\x00\\x00\\x0D\\x00/\\x14\\x00\\x00\\x0E\\x00\\x02\\xAA\\x00\\x00\\x0F\\x00\\x0D\\xD3","processingtimems":12321,"client":"10.17.0.1:40306","queuetimems":5,"class":"HRegionServer"}

    I don't fully understand the logs, but I'm suspecting this is a resource issue, perhaps with the way regionservers are set up, maybe a bottleneck somewhere provoking DB to not function properly?

    I'm sorry for the long post, but I wanted to give as much info as possible, we really need to get this fixed.

    Thanks in advance to anyone willing to give us a hand in this.



    ------------------------------
    Adrian Vergara
    SRE
    Zuora
    ------------------------------


  • 2.  RE: Graphs not working on Zenoss 6

    Posted 3 days ago
    I had to radically increase memory allocation to my region server. I didn't read all your data really close, but it seemed familiar. What is your current region server memory usage and allocation?

    I don't think the ZenHub things are relevant -- zenhub isn't involved in metric transport -- only modeling information and event processing.

    ------------------------------
    Luke Lofgren
    Infrastructure Architect
    Acxiom Corporation (home based associate)
    Waterford PA
    ------------------------------