#39018 [BC-Insight] Rate Limiting Under-Specification and Consequences
Submitted on Jan 20th 2025 at 13:52:20 UTC by @csludo for Attackathon | Ethereum Protocol
Report ID: #39018
Report Type: Blockchain/DLT
Report severity: Insight
Target: https://github.com/ethereum/consensus-specs
Impacts:
Unintended chain split affecting less than 25% of the network (Network partition)
Description
Brief/Intro
The rate limiting mechanism in current implementations is under-specified, leading to inconsistent behavior and unintended consequences across clients. This lack of clarity results in network instability, degraded performance, and forks, as demonstrated below. For example, an honest lighthouse client will disconnect from an honest prysm client because they chose different rate-limiting parameters and have different rate-limiting error codes.
Vulnerability Details
Rate limiting is a critical component of such a decentralized network, ensuring that nodes do not get overwhelmed by misbehaving peers, malicious actors, or excessive traffic. Such a component needs to be well-defined and consistently implemented across clients, to avoid incompatible behaviors and incompatibility.
Rate limiting, as implemented across various clients, relies on generic algorithms such as the leaky bucket model. However, the specification provides insufficient detail about key parameters and their expected behavior, resulting in:
Ambiguity in Configuration: Different clients implement varying rates and limits, leading to inconsistency in how rate limiting is applied.
Lack of Standardization: Critical aspects, such as whether limits should be peer-specific or topic-specific, are not clearly defined.
Undefined Errors: There is no guidance on how clients should handle nodes that temporarily exceed limits, leading to excessive disconnections. Additionally, peer score adjustments can misinterpret rate limiting errors as RPC errors, further exacerbating the problem. Standardizing error codes for rate limiting would help mitigate this issue.
This is especially problematic because these issues disproportionately affect well-connected peers with low latency. These peers are among the most important for maintaining network stability and efficient communication.
For example, Lighthouse clients frequently face disconnections when interacting with Prysm nodes due to Prysm's restrictive rate limiting configuration. The rpc-limiter-topic in Prysm only allows 5 request units per second (with a capacity of 10 units). When Lighthouse attempts to sync, its requests often exceed this limit, resulting in disconnections. Note, that this happens even though lighthouse is rate limiting its outgoing requests. This leads to an inability to maintain connectivity and highlights the need for standardized rate limiting practices across clients.The POC below demonstrates the impact of this under-specification and the resulting network fork.
Impact Details
The under-specification of rate limiting has significant consequences:
Inconsistent Interoperability: Clients implement varying thresholds, leading to unpredictable behavior when nodes communicate across implementations.
Unintended Disconnections: Well-connected peers with low latency may face frequent disconnections, as they inadvertently exceed restrictive limits.
Network Partitioning: Excessive disconnections can isolate nodes, particularly new joiners, from the network. This increases the risk of forks and reduces overall network stability.
Degraded Performance: Disconnecting from well-connected peers negatively impacts propagation times and the efficiency of message delivery.
These issues are exacerbated when multiple clients with differing rate-limiting behaviors operate on the same network, further increasing the likelihood of partitions and degraded consensus.
References
To highlight our point, we reference some code parts from different clients and how differently rate limiting is handled. The versions are the latest stable versions at time of writing.
Prysm
Prysm does inbound and outbound rate-limiting: Inbound: https://github.com/prysmaticlabs/prysm/blob/v5.2.0/beacon-chain/sync/rate_limiter.go#L65 On Inbound rate-limiting it aggregates some topics together and also has a separate limit for all RPC requests of one peer: https://github.com/prysmaticlabs/prysm/blob/v5.2.0/beacon-chain/sync/rate_limiter.go#L37 The values for inbound are different than the lighthouse values. Here are some default values: https://github.com/prysmaticlabs/prysm/blob/v5.2.0/cmd/beacon-chain/flags/base.go#L184-L193 Outbound rate-limiting: https://github.com/prysmaticlabs/prysm/blob/v5.2.0/beacon-chain/sync/initial-sync/blocks_fetcher.go#L136 The prysm error message: https://github.com/prysmaticlabs/prysm/blob/v5.2.0/beacon-chain/p2p/types/rpc_errors.go#L12
Lighthouse
Lighthouse does inbound and outbound rpc rate limiting: https://github.com/sigp/lighthouse/blob/v6.0.1/beacon_node/lighthouse_network/src/rpc/mod.rs#L171 Lighthouse uses different values compared to prysm: https://github.com/sigp/lighthouse/blob/v6.0.1/beacon_node/lighthouse_network/src/rpc/config.rs#L102 Lighthouse sends a different type of error message: https://github.com/sigp/lighthouse/blob/v6.0.1/beacon_node/lighthouse_network/src/rpc/methods.rs#L603 Lighthouse also separates the case of requests being too large: https://github.com/sigp/lighthouse/blob/v6.0.1/beacon_node/lighthouse_network/src/rpc/mod.rs#L410
Lodestar
Lodestar does inbound rate-limiting per peer and protocolID: https://github.com/ChainSafe/lodestar/blob/v1.25.0/packages/reqresp/src/rate_limiter/ReqRespRateLimiter.ts#L57 It takes configuration values from Lighthouse: https://github.com/ChainSafe/lodestar/blob/v1.25.0/packages/beacon-node/src/network/reqresp/rateLimit.ts#L7 However, the link and these configuration values are outdated. Additionally, Lodestar also implements a total rate limiter across all connections (which is currently not used): https://github.com/ChainSafe/lodestar/blob/v1.25.0/packages/reqresp/src/rate_limiter/ReqRespRateLimiter.ts#L15
Proof of Concept
Proof of Concept
To emphasize the consequences of the under-specification, we provide a POC where the rate limiting between two clients results in the incapacity of two honest clients to communicate properly after a network event.
To replicate this issue, you can simulate a local network and introduce a temporary network split for a few minutes. This will force nodes to resynchronize, triggering the rate limiting behavior. We recommend using unmodified, stable versions of nodes for this test. The most efficient way to perform this test is by using attacknet (https://github.com/crytic/attacknet).
Set up attacknet with the repository's instructions and then create two config files:test-suites/poc_config.yaml
and network-configs/poc_net_config.yaml
with the content below and then run the simulation using ./attacknet start poc_config
.
During the simulation, you can observe the network with Dora (deployed with this setup). You will notice that the fork remains unresolved because the nodes fail to communicate properly due to rate limiting. To confirm that rate limiting is the root cause, repeat the same simulation without rate limits. Also, observing the logs shows the rate limiting from Prysm and the following peer score decrease (by Lighthouse) since it thinks it was an RPC error. All of this preventing the clients to maintain proper communication.
Note that this is happening even though lighthouse limits its outgoing requests, because, as argued above, configuration parameters are not standardized.
test-suites/poc_config.yaml
:
network-configs/poc_net_config.yaml
:
Was this helpful?