Configuration & Administration

Expand all | Collapse all

Zenoss Core 6.2.3 runaway memory

  • 1.  Zenoss Core 6.2.3 runaway memory

    Posted 07-13-2020 03:34 AM
    Hi,

    I have problems with memory consumption. The memory is constantly increasing and Zenoss will crash.
    After a reboot the system is ok for about one week. Until the memory is used up.
    I have in general more problems with version 6.2.3, memory resources and "localhost heartbeat failure".

    I have no idea how I can solve the problem.

    In this case, I updated Zenoss from 6.2.1 to 6.2.3. My feeling is, the problem was not visible with 6.2.1.


    [zenoss@scoutzen01 ~]$ free
    total used free shared buff/cache available
    Mem: 28651400 27626588 223300 1536 801512 628876
    Swap: 16382972 14157108 2225864

    [zenoss@scoutzen01 ~]$ free
    total used free shared buff/cache available
    Mem: 28651400 27768644 222736 1536 660020 486760
    Swap: 16382972 14187568 2195404

    Thanks,
    Daniel

    ------------------------------
    Daniel Vogel
    IT Infrastructure Architect
    ABC Systems AG
    Schlieren
    ------------------------------


  • 2.  RE: Zenoss Core 6.2.3 runaway memory

    Posted 07-13-2020 01:53 PM
    Daniel,

    Your best move will be to determine what's consuming the memory on your Control Center host.  If you're not already doing so, add the CC master to the /ControlCenter device class and let it self monitor for a few hours/days.  Once you have some data built up, you can use the Component Graphs option of your CC device to see all the running services in one place:


    Choose the CC-Service component, the Memory Usage graph, and then check "all on same graph."  This will result in a very busy graph, but if any service is showing a large spike in memory consumption, it should appear as a widening track on the graph.

    (Note: my image above only shows 15 minutes of gathered data as I didn't want to wait a full day to get you a "better" graph.  For your investigation, you will likely need more data than I'm demonstrating here.)

    Since each service will list two data points ("Total RSS" and "Cache"), your graph may be difficult to read.  For my lab system, 67 CC-Services * 2 data points = 174 graph points.  If yours is similar, remember that you can hide data points by clicking on them in the graph legend.

    Once you have the memory consumption narrowed down to a service or two, let us know your findings?



    ------------------------------
    Michael J. Rogers
    Senior Instructor - Zenoss
    Austin TX
    ------------------------------



  • 3.  RE: Zenoss Core 6.2.3 runaway memory

    Posted 07-14-2020 06:47 AM
    Michael,

    Sorry, I updated to version 6.3.2 not 6.2.3.

    I think the Control Center ZenPack is "Commercial" and not available with Zenoss Core.

    The Control Center GUI itself has performance graphics. I think I can have a try with this?

    Thanks,
    Daniel


    ------------------------------
    Daniel Vogel
    IT Infrastructure Architect
    ABC Systems AG
    Schlieren
    ------------------------------



  • 4.  RE: Zenoss Core 6.2.3 runaway memory

    Posted 07-14-2020 06:56 AM
    ...one thing more.

    I see with the Control Center some missing health check with "MetricShipper".
    It's very difficult to start this service successful.



    Thanks,
    Daniel


    ------------------------------
    Daniel Vogel
    IT Infrastructure Architect
    ABC Systems AG
    Schlieren
    ------------------------------



  • 5.  RE: Zenoss Core 6.2.3 runaway memory

    Posted 07-15-2020 12:35 PM
    Daniel,

    My apologies!  I was in such a rush to respond that I forgot the CC ZenPack wasn't Open Source.

    You absolutely can use the graphs in CC to determine which service/container is chewing up your RAM.  The memory utilization metrics shown in CC are the same ones polled by the CC ZenPack, so the data is fine (if slightly more tedious to inspect).

    If you click on the MetricShipper service name, its overview page in CC should provide you with a link to its logs:


    (You can use that same shortcut button on most services.)

    The "store_answering" healthcheck simply curls http://localhost:8080/ping/status/metrics and checks for a 200, throwing a failure on any other return.  The "store" in question would be the MetricConsumer service, which is the next step in the performance metric pipeline.  If the MetricShipper logs don't give you any good clues, check the MetricConsumer logs (and/or give the service a restart).

    If the logs are less than obvious, please feel free to paste your findings here.




    ------------------------------
    Michael J. Rogers
    Senior Instructor - Zenoss
    Austin TX
    ------------------------------