34093 - [BC - Critical] lib-net can be used to force oom reap of shardu...

lib-net: can be used to force oom reap of shardus instances

Submitted on Aug 5th 2024 at 18:58:30 UTC by @riproprip for Boost | Shardeum: Core

Report ID: #34093

Report type: Blockchain/DLT

Report severity: Critical

Target: https://github.com/shardeum/shardeum/tree/dev

Impacts:

  • Network not being able to confirm new transactions (total network shutdown)

  • RPC API crash affecting projects with greater than or equal to 25% of the market capitalization on top of the respective layer

Description

Brief/Intro

Disclosure: Wasn't sure on the severity level / impact you would want to classify this as.

Bug can be used to crash shardus-instances.

If used intelligently can be used to sometimes crash all the instances (without them getting restarted). Used even more complicated: Can be used to crash other processes in the OS.

I would find that to be severe ...

Vulnerability Details

https://github.com/shardeum/lib-net/blob/2832f1d4c92a3efb455239f146567f21fd80e4cb/shardus_net/src/shardus_net_listener.rs#L95 allows attacker controlled allocations on the system.

On some OS that already triggers the crash.

On other OS those allocations themselves usually get delayed. But once we actually read in data in L106, the allocator has to pull in more pages till the oom reaper gets triggered.

What the oom reaper does and when exactly it gets triggered is again highly dependent on the physical configuration of the box and the OS. It is however always a kill of a system process.

Different tested configs:

  • On a box with 1gb of physical ram I have seen crashes after sending 2512kb of data. Those crashes however couldn't reliably reap the restarting processes that "shardus start" apparently runs in the background. So that process starts a new shardus intstance. With more tinkering it's likely possible however to reap the restart process also.

  • On a box with 8gb of ram I have seen crashes after sending much closer to 8gb of data. Sending data in parallel to multiple instances (in theory) reduces the complete amount of bytes that need to be send. (Again depends on the allocation algorithms used be the OS). Notably those crashes reliably reap the "restart" (I think you call it pm?) process on my linode box, leaving no shardus-instance running.

  • To widen the impact: In theory one could allocate close to the maximum amount of ram on the system and wait till any other process on the system allocates ram. That one would somewhat likely get reaped on modern linux systems. (Not a Mac guy ...)

If somebody brings up the amount of traffic when wanting to downgrade this bug: I saw references to Gzip and Brotli. Both are extremely efficient at compressing repeating patterns. For reference compression rates on the older Gzip:

  • 1Gb is 1MB in traffic

  • 14GB is 7MB in traffic

  • 64Gb is ~64 MB in traffic

Proof of concept

Simple POC

I have 2 POC. The simpler one shows that javascript using lib-net can't recover from the allocation problems. The error does not bubble up to javascript to handle. The process just dies.

Please note that the server output of memory allocation of 4294967295 bytes failed does represent the actual amount of data sent. It's just the requested allocation the kernel finally has to fulfill after receiving ~2512kb of data.

Before you make me document the more complicated one doing the "oom reaper thing to shardeum and the instances" please keep in mind that your process.on('SIGINT|SIGTERM') handlers all just die.

You can however use the net_attack code against your shardus start N instances. Depending on your configuration, different things will happen like explained above. I am happy to go into detail howto maximize impact.

server

save as test.js

run

attacker

run

save as src/main.rs

run

output

server

attacker

dmesg

Last updated

Was this helpful?