Configuration & Administration

Expand all | Collapse all

Zenoss 6.2.1, Zope stops answering on its own, unprovoked

Jump to Best Answer
  • 1.  Zenoss 6.2.1, Zope stops answering on its own, unprovoked

    Posted 12-18-2018 08:34 AM
    Edited by Jad Baz 12-18-2018 08:35 AM
    Hello,

    I'm having a weird problem and I can't get to the bottom of it.

    Zope stops answering after some time



    The failing healthcheck is (where 9080 is the Zope exposed port)
    curl -A 'Zope answering healthcheck' --retry 3 --max-time 2 -s http://localhost:9080/zport/ruok | grep -q imok​
    This simply times out.

    So we netstat
    [root@testcontroller ~]# serviced service attach zope netstat -tlnp | sort -n
    Active Internet connections (only servers)
    Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
    tcp        0      0 0.0.0.0:11211           0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:11212           0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:15672           0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:3306            0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:4369            0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:44001           0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:5042            0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:5043            0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:5443            0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:5601            0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:5672            0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:8084            0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:8444            0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:8789            0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:8983            0.0.0.0:*               LISTEN      1/serviced-controll
    tcp        0      0 0.0.0.0:9080            0.0.0.0:*               LISTEN      -
    tcp6       0      0 :::22350                :::*                    LISTEN      1/serviced-controll
    tcp6       0      0 :::443                  :::*                    LISTEN      1/serviced-controll


    9080 is simply down.

    Zope logs don't have anything unusual
    The last events on Kibana (Z2.log):

    December 18th 2018, 03:28:03.000
    127.0.0.1 - Anonymous 18/Dec/2018:01:28:03 +0000 "GET /zport/ruok HTTP/1.1" 200 178 "" "Zope answering healthcheck"
    December 18th 2018, 03:28:00.000
    172.17.0.1 - Anonymous 18/Dec/2018:01:28:00 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
    December 18th 2018, 03:27:55.000
    172.17.0.1 - Anonymous 18/Dec/2018:01:27:55 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
    December 18th 2018, 03:27:50.000
    172.17.0.1 - Anonymous 18/Dec/2018:01:27:50 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
    December 18th 2018, 03:27:45.000
    172.17.0.1 - Anonymous 18/Dec/2018:01:27:45 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
    December 18th 2018, 03:27:45.000
    127.0.0.1 - Anonymous 18/Dec/2018:01:27:45 +0000 "GET /zport/ruok HTTP/1.1" 200 178 "" "Zope answering healthcheck"
    December 18th 2018, 03:27:40.000
    172.17.0.1 - Anonymous 18/Dec/2018:01:27:40 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
    December 18th 2018, 03:27:39.000
    172.17.0.1 - Anonymous 18/Dec/2018:01:27:39 +0000 "GET /zport/dmd/zenossStatsView/ HTTP/1.1" 200 806 "" "python-requests/2.6.0 CPython/2.7.5 Linux/3.10.0-957.1.3.el7.x86_64"
    December 18th 2018, 03:27:39.000
    172.17.0.1 - Anonymous 18/Dec/2018:01:27:39 +0000 "GET /zport/dmd/zenossStatsView/ HTTP/1.1" 200 806 "" "python-requests/2.6.0 CPython/2.7.5 Linux/3.10.0-957.1.3.el7.x86_64"
    December 18th 2018, 03:27:35.000
    172.17.0.1 - Anonymous 18/Dec/2018:01:27:35 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
    December 18th 2018, 03:27:30.000
    172.17.0.1 - Anonymous 18/Dec/2018:01:27:30 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
    December 18th 2018, 03:27:30.000
    127.0.0.1 - Anonymous 18/Dec/2018:01:27:30 +0000 "GET /zport/ruok HTTP/1.1" 200 178 "" "Zope answering healthcheck"
    December 18th 2018, 03:27:28.000
    172.17.0.1 - Anonymous 18/Dec/2018:01:27:28 +0000 "GET /robots.txt HTTP/1.1" 200 221 "" "Zenoss ready healthcheck"
    December 18th 2018, 03:27:24.000
    172.17.0.1 - Anonymous 18/Dec/2018:01:27:24 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
    December 18th 2018, 03:27:19.000
    172.17.0.1 - Anonymous 18/Dec/2018:01:27:19 +0000 "GET / HTTP/1.1" 302 190 "" "ZProxy_answering Healthcheck"
    So from the looks of it, there was some regular traffic to Zope and then it suddenly stopped.
    The only other logs in /opt/zenoss/log are zeneventserver.log and the last log is one day before the crash.

    From the above, I can only guess that this is a performance issue since there was not a single error anywhere.
    The thing is, the system memory, serviced memory and Zope memory were all good. This is a 3-hour snapshot of the memory for Zope leading up to the crash. Again, nothing unusual.



    Moreover, MetricShipper is not answering as a result of this:
    The error that is coming up is:
    Unable to connect to consumer ws://localhost:8080/ws/metrics/store​


    The final thing I'll say is that this has happened before and a simple restart of Zope does the job. However, we can't keep waiting for it to fail in production and hit restart. But what this does indicate is that it is not a configuration issue or a state issue or some persistent error. If it were so, it would persist across restarts.

    So I'm debugging some performance issue and don't know what I can do anymore.
    The thing is, I've done load testing for a few hours several times this week and on all occasions, Zope was still up. I've looked at all logs and there was nothing special happening around that time. It's just so random. A bit like radioactive decay!

    I'm at my wit's end here, honestly.
    Any ideas?

    ------------------------------
    Jad
    ------------------------------


  • 2.  RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked

    Posted 12-19-2018 10:52 AM
    I think I've figured it out. It was a spike in CPU usage.

    ------------------------------
    Jad
    ------------------------------



  • 3.  RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked
    Best Answer

    Posted 12-26-2018 11:53 AM
    There is also currently a known issue with Zope's caching layer which can cause deadlocks.  I believe there's a fix coming for that in the next release but for the time being you can crontab a restart of your zope instances (zope, zauth, zenapi, zenreports) every night at midnight to help prevent it from getting to the point where a deadlock can happen.

    See: https://jira.zenoss.com/browse/ZEN-30762

    ------------------------------
    Ryan Matte
    ------------------------------



  • 4.  RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked

    Posted 03-27-2019 01:58 PM
    Coming back to this, it is not a CPU or memory usage issue.
    So far, I've experienced it as absolutely random

    ------------------------------
    Jad
    ------------------------------



  • 5.  RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked

    Posted 03-29-2019 06:44 AM
    Also ref Zenoss 6.1.1 graphs show no data, zenhub and MetricShipper failing some health checks


    ------------------------------
    Jad
    ------------------------------



  • 6.  RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked

    Posted 30 days ago
    I  am seeing the same issue with our Zenoss setup.  When Zope stops working, Metric shopper stops working.  This used to happen every 7 days and now its seems to just be random.

    ------------------------------
    Adan Mendoza
    VTX1
    Raymondville TX
    ------------------------------



  • 7.  RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked

    Posted 30 days ago
    This is a known issue where over time zope threads will become unresponsive.  We've made several changes to address this in recent versions of the code base.  For the time being you can just schedule a zope restart once a night during off-hours which should prevent the issue from occurring during normal operation.  You'll want to restart all of the zope services (zope, zenapi, zenreports, zauth).

    ------------------------------
    Ryan Matte
    ------------------------------



  • 8.  RE: Zenoss 6.2.1, Zope stops answering on its own, unprovoked

    Posted 28 days ago
    I have used cron to restart the UI every day at 1am using the command:

    /usr/bin/serviced service restart "User Interface"

    ------------------------------
    jstanley
    ------------------------------