Zenoss Core 4.2.5 SUP 457.
I am writing a ZenPack that uses the PythonCollector 1.7.3 and creates a dsplugin - in fact 3 dsplugins.
I have 2 test devices - very similar - both have the same templates bound that use these dsplugins (one is a device template; 2 are component).
I am seeing very quirky behaviour when sometimes one / sometimes both devices are down. The device goes down, the PythonDataSourcePlugin exceeds the blockingtimeout (which I have upped to 10s) and a datasource is disabled and added to /opt/zenoss/var/zenpython.blocked. I have definitely seen this with one device down and one device still up. Generally it is only one of the three plugins that is disabled.
I have also seen a slightly more understandable behaviour when both devices went down and all threee plugins were disabled. Still not good - the devices may all be behind a single-point-of-failure.
I think the events that get raised are also similarly quirky.
Can anyone else contribute to this testing so we can raise a sensible ticket
Email: jane.curry">mailto:email@example.com">firstname.lastname@example.org Web: http://www.skills-1st.co.uk
Just done some testing with the redis ZenPack - https://github.com/jcurry/ZenPacks.community.zplib.Redis .
Again, when a device is uncontactable, the datasource is permanently disabled in /opt/zenoss/var/zenpython.blocked.
What you're describing sounds like the intended behavior of the new "blockingtimeout" option and feature in PythonCollector 1.7.3. That may make it sound like a bad feature, but let me explain exactly what it does and why you might be seeing that behavior.
A few months ago we had several users on PythonCollector 1.7.1 report problems with zenpython not collecting to varying degrees. In some cases it would appear to get completely deadlocked and stop collecting all datasources for all devices, and a /Status/Heartbeat event would eventually be created indicating zenpython was no longer functioning. In other cases it was more subtle, lots of tasks were missed or delayed which resulted in gaps in graphs and delays in detecting problems.
In all of these cases we eventually found the cause to be PythonDataSourcePlugin subclasses in ZenPacks (zenpython plugins) using blocking IO instead of non-blocking IO using Twisted's mechanisms. In the extreme case of a completely deadlocked zenpython the cause was a plugin that made a synchronous MySQL query that had no timeout set, so in some cases where the database would never finish responding to a query, zenpython would remain blocked and unable to perform any collection until it was restarted. In the less severe cases it was more typical things where sometimes making API calls takes many seconds or even many minutes, but even these would accumulate to making zenpython not be able to do any other collection until those blocking API calls completed.
We released PythonCollector 1.7.2 with the "blockingwarning" option, a new "percentBlocked" datapoint, and detailed task state tracking for how long tasks spent blocked. The idea was to get a better idea of how prevalent the problem was, and which common plugins were affected. It didn't do anything to keep zenpython from getting wedged by a badly behaved plugin. We found some issues with plugins in the MySqlMonitor and AWS ZenPacks using blocking IO. We've since released an update to MySqlMonitor that fixed its plugins, and are working on an update to the AWS ZenPack that fixes its issues. That said, some users were still concerned that a badly behaved plugin could wedge zenpython.
To address this concern, PythonCollector 1.7.3 was released with the "blockingtimeout" option which defaults to 5 seconds (that may seem too short, more on that later.) The blockingtimeout option is a mechanism for zenpython to protect itself from getting wedged by a plugin that uses blocking IO, and it works as follows.
A timer is started each time the collect method of any PythonDataSourcePlugin is called. The timer is cancelled as soon as that collect method returns. If the timer reaches blockingtimeout before the collect method returns, the zenpython process completely restarts itself with the plugin that failed to return in time disabled. This may sound extreme, but for technical reasons was really the only way to recover from being blocked.
To understand why zenpython must restart itself you have to understand how zenpython is a single-threaded process that uses asynchronous IO via the Twisted library to efficiently perform lots of parallel collection. Anytime non-asynchronous code is being executed it means that no other collection or process can be performed until it yields back to the reactor (Twisted's event loop). The catch here is that zenpython is executing arbitrary code provided by ZenPacks. The zenpython code essentially gives up control to that ZenPack code as soon as it calls it, until that code returns. The way that zenpython manages to restart itself when it calls to ZenPack code that doesn't return is by running a watchdog thread that keeps an eye on that timer and restarts the process if it expires.
So another question you'd ask, and I think you implied, is why disable the plugin for 2 devices when it only timed out on 1 device To understand why that is done you have to imagine a scenario where you have a plugin that executes on all Linux servers. Let's say you have 1,000 Linux servers being monitored, and let's say that 50 of them develop some condition that causes the plugin to block on them for 10 seconds. Assuming it takes zenpython only 1 minute to restart when monitoring 1,000 servers, it would take (60 seconds + ~10 seconds) * 50 servers, or ~11 minutes before zenpython would have all of the plugin+device executions disabled. During this time zenpython wouldn't be doing anything. So we're forced to assume that if a plugin blocks once, that it's not written correctly and using blocking IO. So the plugin is completely disabled for all devices that use it.
So now back to why the default blockingtimeout is set to 5, which may seem way too low given everything above. To understand this, you again have to understand why the PythonDataSourcePlugin.collect() method must return a Deferred. After all, it's easy to imagine making an API call to some application that takes more than 5 seconds. When using Twisted's non-blocking networking APIs you immediately get a Deferred object returned when you make a call like getPage("http://example.com/api/really-slow-thing"), even if the response doesn't come back for 10 minutes. When the response comes back, the Deferred's callback (or errback if there's an error) is called. The PythonDataSourcePlugin.collect() must do the same thing. Usually by returning the Deferred object that something like getPage("http://...") returned to you.
If you properly use Twisted's networking libraries to perform non-blocking IO in this way, your collect method will return nearly instantaneously even if the device being monitored is down, or slow, or anything like that. So your API call can take way longer than blockingtimeout, but your collect method still returns immediately.
I'd recommend taking a look at the Weather Underground ZenPack tutorial at the following URL for an example of writing PythonDataSourcePlugins that using non-blocking IO.
How can I disable this "feature" Setting the blockingtimeout to a higher threshold does not seem to change anything
Thanks Chet! One of my ZenPacks isn't up to par and I didn't want to downgrade
I am noticing that zenpython stops monitoring a device because it says the ds is not idle.. what could cause this (I am calling code with deferToThread in collect())
DataSourcePlugin skipped because it was not idle
Need to be very careful with zenpython and deferToThread.
I ended up finding a very unique and rare case where the code I had wrote would hang forever. This would cause the thread to stay open. After enough of these errors, the threadpool would fill up and all the rest of the monitoring would then "stop". (or severly slow down)
Each of the datasources that were executed would be in "queue" for the next available thread. Because of this, zenpython would still see them as active and not run them again until they finished.