Block Producer - Failover Approach with a BP Standby

If 2 producers from 2 different pools will create the same block then a battle will occure and the small vrf will win
If 2 producers from same pool with same vrf will produce the block… then who will win the battle if the vrf is the same?

One gets orphaned by losing a height battle, I believe. That is, one fork is orphaned because the other is extended with the next valid block and picked up by a majority of peers. It’s a dice roll based on network propagation, if I understand this correctly. (However, I probably don’t :joy:)

Not really. To be able to run such a scenario you need pools with enough active stake to generate a big amount of blocks. And I do not beliefe that it would threaten performance. The height based orphaning is not such a big overhead i think. Still it is unnecessary. And we are the ones who support Cardano. We want it do run smooth. And not take a architecture just to be easy for us as individual pools. Hosting and architecture decisions needs to be taken in a way which supports the network.

Its the same discussion regarding hosting providers. Cheapest providers like Contabo are used by many pools. It’s better to go for providers which are not used that heavy. Again a conflict between individual goals (low cost) vs overall network health (physical decentralization)

I am thinking (hypothetically) SPOs x 2 core nodes each would spam the network with a lot of transactions which would ultimately be discarded. Not ideal.

Ah, yes of course.

I agree, although this still seems like a significant area of improvement for SPOs. Small pools are the ones who can least afford to miss a block, but they’re also least likely to have a reliable fail-over solution implemented. So I’m glad you shared your approach! It would be nice to see something like this tested, refined and adopted more broadly.

The spam would come from the relatively infrequent orphaned block… something that happens on the network already, more or less by design of the consensus protocol. I agree that increasing the frequency of it is ugly.

The only difference operationally between a relay and a producer is the keys. Just keep one spare running without the keys to make sure the db is always synced. During the fail over just inject the keys. You do need some kind of heardbeat between them to make sure that only ONE runs with the keys. But this should be very easy to script up.
This is exactly what I did when moving the node to a different VM.

  1. Remove keys on the current. Reboot
  2. Inject the keys on the new one. Reboot

Takes like 20 secs all together (without the DNS switch). Just automate the above. There are many ways to do it. The easier would be to have an environment variable like RUNAS_PRODUCER=true. Then get your startup script to query that variable at the start and use the keys if true. Then your heartbeat just needs to toggle the variable on both sides. As I said, many ways to do it…
It is much easier to manage the whole thing with containers. For instance, in my case, I spawn a completely fresh node image on any non-zero exit. It checks environment to decide whether to spawn as a relay or as a producer. It is bloody impossible to kill my node short of blowing up the data centre.

You cannot run 2 BPs at the same time cos it defeats the point of the certificate. If 2 nodes are using the same certificate, it means one of your BPs is effectively “stolen”. It is like 2 people using the same passport. I am not even going into the mess it is going to create when chosen as the leader. Obviously the blocks will be invalidated as this is the core of the blockchain security. Think about it… If I could run many nodes with the same certificate, I could just agree with 1000 wealthy operators to run our node as a single identity with the same certificate. We could just in theory have 95% of stake being held by a single producer lol. Produce all blocks. Have all the rewards. Makes no sense

1 Like

100% agree! What I shared with this thread is my approach to:

To avoid a scenario where both nodes beliefe that the other node is not working I used an external microservice which makes sure only one node is defined to be active at the same time.

Your idea regarding using an environment variable to execute the switch is good and perfectly fits into the approach. Maybe i share a working example on that basis later as a reference for the activation and stanby script.

Please go into it! At the very least, this scenario could happen unintentionally if a failover system malfunctions (e.g., a split horizon condition) so it would be quite helpful to understand the implications.

Sorry, I thought I explained in short that it is not possible by definition.
It just defeats the core security principal of the PoS consensus.
If that was possible, you’d have 51% attacks and all kind of stake attacks.

The point of a certificate, any certificate, is to give an entity photographically protected identity. We run our producer with a certificate to ensure that there is only one producer with a specific ID. Otherwise we could just give our producer a ticker and be done with it. In this scenario, the consensus would not hold. It holds because it is extremely difficult for a single entity to accumulate over 51% coins. Without the certificate or if we could share the certificate and run many nodes with the same one ( which is essentially running without one), we could just group all our wealth together under a single entity.
So this answers why we cannot do that.
As for how this protection is implemented by Cardano protocol, it doesn’t really matter. Your producer ID is registered on the blockchain. Duplicate identities would be visible right away. It is akin to have a double spending. First, your blocks will be invalidated. My assumption is that the duplicate producers will be invalidated too as they will be treated as compromised. Haven’t tried it myself so cannot tell you exactly what is going to happen. If I designed the protocol, I’d invalidate the duplicate IDs for good so that you have to regenerate the certificate from scratch and re-register it again. Not sure what Cardano does.

So let’s try to learn what Cardano does!

As with Ethereum and Bitcoin, double-spending attackers (<51%) aren’t punished by the protocol, they are simply ignored, making the attack pointless.

There’s no financial incentive to amass resources behind a single Cardano pool because rewards diminish when stake grows beyond the saturation point. I can’t think of what incentive a successful pool would have to get another pool to mint blocks on its behalf, or what incentive they would have to run nodes using another pool’s certificates (identity).

Why would both blocks be invalidated?

If one pool propagates two otherwise valid blocks, I believe this will result in a “height battle”. One of those blocks will have their hash propagated by the next valid block, and the other is simply orphaned and discarded. No big deal, no need to punish that pool, and no need to assume foul play.

I’m not arguing that pools should run concurrent BPs. It would add some overhead to the network, although probably not that much. (Unless everybody did it…) But I just can’t find any clear technical explanation for why they shouldn’t do so, purely based on self-interest, or what the consequences are for doing it by mistake.

There are good reasons to understand this more deeply, particularly for anyone interested in developing a failover solution.

Case in point:

That’s another opportunity for consensus failure. What happens when the network is partitioned and the microservice isn’t reachable, or it otherwise fails?

If you think that’s unlikely, consider that the conditions that would prevent nodes from seeing each other are the same conditions that might prevent them from reaching the microservice to ask who the active BP should be. This is why failover solutions (and consensus protocols!) are nontrivial to implement in the first place, and why it’s useful to understand the consequences of a split-horizon situation, should it occur: you need to decide what the default behavior is should a node become blind to its peers.

If your node can’t reach its source of truth, should it assume it’s a BP or assume it’s a relay? Is it better to miss out on creating the block, or is it better to create the block with the understanding that it might be orphaned as a duplicate?

So far, I’ve been able to identify just one aspect of pool performance that influences rewards, with regard to network participation, and that’s making damned sure you don’t miss any blocks:

This leads me to believe that it’s better for your failover solution to err on the side of duplicating a block rather than missing it entirely.

Finally we get into discussing the failover approach itself which was my intention when creating the thread :wink:

My approach is simply not changing anything if it gets “blind”. Means if BP1 is active and the microservice is not reachable any more it stays active. If BP2 on the other hand is able to connect to the microservice the microservice will make it active and tell the BP1 the next time when it is able to connect to the microservice again. (assuming that if BP2 is able to connect but BP1 isn’t BP1 has some networking or hardware issue).

To sum it up:
Bot BPs are not reaching the microservice or microservice down → Nodes continue as they are
One BP is not reaching the microservice for >5 minutes (trying every 10s) → Will trigger a switchover

The scenario always first defines the new active before telling the standby to be standby (5 seconds time difference, or when the standby is able to connect next time). Based on the required time to restart the node there should not be an overlap in both being active, but this could be further tuned to wait until the new active really is up before telling the old active to restart as a BP.

:joy:

Not trying to threadjack, my point here is that the implications are actually very central to what you’re doing, depending on your goals.

It sounds like you’re aiming for “eventual failover”, allowing for some period of complete downtime, as opposed to ensuring you have one active BP at all times. Is that correct? If your BP fails, you’re OK with a potential five-minute (plus restart time) window where there is no BP running?

That sounds fine, or at least it’s better than having to do a failover manually. I think it’s important to recognize that if you get very unlucky you might miss a block in the time it takes to fail over, and that will impact your rewards. That’s probably unacceptable for bigger pools who are frequently assigned blocks.

What happens if the BP and the microservice are unreachable for an extended period of time?

No no, you are bringing the thread back to where I wanted it to be. Thank You!

Exactly! I’m trying to automate the manual failover. I’m accepting minimal downtime. I think this is not avoidable in this scenario as I need to wait for some time to be sure the active is not just temporarily down to avoid unnecessary failovers.

Agreed. The is still a risk. In a full saturated pool, this risk is 8.3% (1/120 per assigned block in the Epoch) So it is a calculate able risk which is far less compared to a manual switchover.
Anyways a full saturated pool may prefer a live-migration-based approach or a network-based active/active like the failoverd if the network circumstances allow this. This is not the case for my architecture because the 2 BPs are on completely different hosting providers.

If the Microservice is not available for both BPs both will stay unchanged.
If only one BP is not able to connect to the microservice my assumption is that this server is also not in a valid state so the other BP will take over.

So there is a theoretical scenario with 2 Active Nodes when:

  • The current Active BP is losing connection to the Microserve
  • The current Active BP is still healthy and connected to the Cardano Network

In this case the Standby would be promoted to active while the old Active stays active until it is able to connect to the Microservice again. I think the risk for this to happen is very low. Why should the Service be reachable for one and not for the other. It could happen based on firewall limitations which only apply to one of the servers. But should not be something which happens without your notice.

OK, got it. What benefit does the microservice provide vs. the nodes just polling each other directly? The microservice seems like a proxy/cache for node health, whereas the nodes could poll each other directly. And what’s the risk of a bug hiding somewhere in that 350 lines of JavaScript, resulting in exactly the wrong outcome? Does that risk (plus maintenance burden) negate the benefit of having a separate health-polling service?

Scenario 1 (design works as intended):

nodeB (relay) asks microsvc if nodeA (BP) is healthy
microsvc hasn’t heard from nodeA lately (or notices nodeA’s tip is stale), tells nodeB to become BP
nodeB becomes the active BP

(Fine, but what value is microsvc providing?)

Scenario 2 (:spider:):

nodeB (relay) asks microsvc if nodeA (BP) is healthy
microsvc has a weird bug, and incorrectly tells nodeB to become BP
nodeB becomes the active BP at the same time as nodeA

(Oh shit!)

Scenario 3:

nodeB hasn’t heard from nodeA lately (or notices nodeA’s tip is stale), and becomes the active BP

… Scenario 3 seems like a more robust solution, with the added benefit of not having to maintain a separate service.

This is why I harped so long on the dual-active BP topic. If the consequence of this system misbehaving is just an orphaned block, then that failure mode isn’t a big deal.

What am I missing?

Hey, are you assuming I’m creating buggy software? :wink:
Of course, there is always a risk of not considering something or have bugs in the code. So a custom solution is always a bigger risk compared to using some standard software.

The reason why I did not choose Scenario 3 is the following:
If nodeA is not able to talk to nodeB also nodeB is not able to talk to nodeA. So if both servers actually are alive but cannot talk to each other they would both get active.
The risk that this happens is quite low as it would only happen in a network separation scenario between the nodes while the nodes still are able to connect to Cardano. In any other kind of issue, the scenario would be effective. And as you said a dual-active for some temporary time would not be a big problem. It just should not run this way for a long time.

But it did not feel right to me and I wanted an approach with a 3rd party which only allows one server to be defined as the active even if the nodes are not able to talk to each other directly.

One other aspect which was relevant to me is that I did not want to allow any incoming connections. I’m just sending requests out. Not opening any port on the BPs. But this of course would also not be a problem if the firewall is making sure those requests are only allowed from the other BP, maybe through a VPN (but then a VPN down would lead to the above-mentioned network isolation scenario).

So I don’t think you are missing anything. Maybe I’m trying to make it more perfect than needed :wink:

Completely hypothetically! :grin:

The odds are at least as small that the microsvc will be able to talk to both nodes but the nodes not to each other, right? (So, what’s its value?)

Well, to minimize any attack surface, you could just open port 12798 between the nodes, and have them poll each other’s /metrics/ handler for cardano_node_metrics_slotNum_int to do roughly the same health check. You could do it in a couple lines of bash via cron, removing most of the complexity, and with approximately the same likelihood of system failure. (Hmm… I think I just convinced myself to go implement that with my own pool!)

Great. Yes as you said the practical difference seems to be neglectible. Still i will stick with my approach for the mentioned reasons. But it would be great if you share your scripts as well once you got them up. I think that only a small percentage of the pools is running the BP in such a scenario and by sharing our approaches we can help getting the network even more reliable!

Btw : are you aware of the failoverd approach. Maybe this would work for your setup as well without any custom coding.

OK, here’s a very quick-and-dirty version in bash, meant to be run via cron every minute. It assumes that port 12798 is accessible on the peer. Not production-ready but you get the idea:

#!/bin/bash

remote=your-bp.your-pool.io
max_lag_seconds=300
max_lag_slots=100
time_fn=".$(basename $0).heartbeat"

# Test mode, so echo only. Replace with real scripts.
fail_cmd="echo 'failing over!'"
recover_cmd="echo 'recovering from failure!'"

warn() { echo "$@" >&2; }
fatal() { echo "$@" >&2 && exit; }

# Determine if this machine has already taken over for its peer.
if [[ -z "$(curl -s localhost:12798/metrics | grep KESPeriods)" ]]; then
  status=relay
else
  status=core
fi

slot_local=$(curl -s localhost:12798/metrics | grep slotInEpoch \
    | awk '{print $2}')

test ! -z "$slot_local" || fatal "Couldn't get local tip."

slot_remote=$(curl -s ${remote}:12798/metrics | grep slotInEpoch \
    | awk '{print $2}')

if [[ -z "$slot_remote" ]]; then
  warn "Couldn't get remote tip."
else

  # Keep track of the last time the other node responded.
  date '+%s' > $time_fn

  if [[ $(expr ${slot_local} - ${slot_remote}) -gt $max_lag_slots ]]; then
    # Other node is too far behind, take over as core.
    exec $fail_cmd
  else
    if [[ $status = "core" ]]; then
      # Other node has come back online, demote self to relay.
      exec $recover_cmd
    fi
  fi

fi

if [[ -f $time_fn ]]; then

  if [[ $(expr $(date '+%s') - $(cat $time_fn)) -gt $max_lag_seconds ]]; then
    # Other node has been unreachable for too long; become core.
    exec $fail_cmd
  else
    if [[ $status = "core" ]]; then
      # Other node has come back online, demote self to relay.
      exec $recover_cmd
    fi
  fi

fi

# If we've made it this far, neither failover nor recovery has exec'd, so
# we're either in active standby (as relay) or active failover mode (as core).
echo "OK (state: $status)"

Hadn’t seen that yet. Looks like a Mac OS X thing? Also looks like it does reachability only, so def not as snazzy as your idea.