Skip to content

MIP-424 handle node's rabbitmq being down or going down during algorithm execution

Kostas FILIPPOPOLITIS requested to merge dev/MIP-424_handle_rabbitmq_down into master

Created by: apmariglis

When the broker (rabbitmq) is down a call to apply_async(..) (which queues a task, like delay(..) ) will raise an OperationalError or a ConnectionResetError. The same goes when calling the get(..) method on an AsyncResult (which returns the actual result of the task) Concerning the node_tasks_handler_celery.py these exceptions are caught and then raise a (custom) ClosedBrokerConnectionError which contains a message with info about which task failed to be executed because of the closed connection to the broker and on which node. The ClosedBrokerConnectionError is propagated to the AlgorithmExecutor which terminates the execution of the current algorithm. The AlgorithmExcecutor catches the ClosedBrokerConnectionError and raises an AlgorithmExecutionException which is then caught by the webapi exception handler module (error_handlers.py) and returns an ALGORITHM_EXUCUTION_ERROR along with a special HTTPStatusCode signaling that the algorithm execution failed, but without internal information about which node caused the problem or any other detail, just that the algorithm execution failed. Concerning the NodeRegistry, a node with its broker down is just not returned as an active node in the system until its broker comes back up.

Merge request reports