Hacker News

CVE-2019-5736: runc container breakout(seclists.org)

240 pointsafshinmeh posted 2 months ago101 Comments
NathanKP said 2 months ago:

Amazon employee here: we have released a security bulletin covering how to update to the latest patched Docker on Amazon Linux, Amazon ECS, Amazon EKS, AWS Fargate, AWS IoT Greengrass, AWS Batch, AWS Elastic Beanstalk, AWS Cloud9, AWS SageMaker, AWS RoboMaker, and AWS Deep Learning AMI.

Please check out the bulletin and update if you are using one of these services.


tristanz said 2 months ago:

As far as I understand, EKS doesn't support PodSecurityPolicy yet so any user that can launch a pod can trivially root the host via host mounts already. This surprisingly isn't clearly documented.

NathanKP said 2 months ago:

ECS doesn't have a top level resource called "PodSecurityPolicy" but we do provide task level configuration options for all the major settings that you would normally put in your pod security policy, including including adding and dropping capabilities, privileged or unprivileged mode, docker security options for controlling SELinux or AppArmor, ulimits, sysctl settings, among others. You can find all these configuration options and more documented here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/...

It is definitely possible to prevent a task running in ECS from getting root access to the host. If there is something missing that you feel we need to add to ECS to better enable this, definitely reach out and let me know!

tristanz said 2 months ago:

I'm referring to EKS here not ECS. EKS doesn't yet enable the PodSecurityPolicy admission controller, so any user that can launch a pod via EKS can root the EKS cluster regardless of RBAC rules. The main ask here is to just find a way to enable PodSecurityPolicy admission controller so that secure multi-user EKS clusters are possible like ECS.

sethvargo said 2 months ago:

Hey all - Seth from Google here. Please let us know if you have any questions regarding GKE or questions about the upgrade process. I'm also happy to escalate any feedback to our internal product and engineering teams.

Here's link to our security posting with more information and upgrade procedures: https://cloud.google.com/kubernetes-engine/docs/security-bul...

sethvargo said 2 months ago:

One thing I want to emphasize: you are only affected if you're using Ubuntu base images for your node pools. If you're using COS, you are unaffected.

numbsafari said 2 months ago:

It would be great if folks could run COS and leverage tools like Falco. I realize there is a trade-off between having a totally locked down OS and being able to flexibly use such tools.

However, Google and Sysdig announced a partnership around Falco and GCSCC integration. It would make sense that such a tool would be able to be run on COS.

Perhaps I'm guilty wanting to have my cake and eat it, too. But this seems like an area where GKE and COS are somewhat limited.

markstemm said 2 months ago:

Hi, Falco developer here. We do have support for running falco with an ebpf program taking the place of the kernel module. You can learn more about ebpf support at https://github.com/draios/sysdig/wiki/eBPF, and you should be able to run falco with ebpf by setting an environment variable SYSDIG_EBPF_PROBE="".

numbsafari said 2 months ago:

This is awesome news! I see it’s still beta, which is probably why the Falco docs still say GKE users must run Ubuntu images. Adding this to my tracking list. Thanks.

random_copy said 2 months ago:

Falco can read events via an ebpf program (instead of the falco-probe kernel module).


So, falco will work on GKE and COS

jhealy said 2 months ago:

I'm interested in why COS is unaffected - is it due to the read-only root filesystem?

tssuser said 2 months ago:


wicket said 2 months ago:

> However, it is blocked through correct use of user namespaces (where the host root is not mapped into the container's user namespace).

In other words, this won't affect anyone who understands the implications of running a process as root. Unfortunately, the sad truth is that most people I've come across who have "lots of experience" with implementing Docker containers, do not even understand the basics of how they work, let alone the implications of root access. I've interviewed candidates who claim to know Docker but can't even tell me how Docker differs from traditional virtualisation or how it achieves its isolation. The best explanation that most of them come up with is, "Docker containers are more lightweight".

This sort of vulnerability should have been a non-issue but it has gained attention due to the sheer amount of incorrectly configured containers in the wild. This was an accident waiting to happen, and I doubt we've heard the last of this sort of thing.

geofft said 2 months ago:

I think it's subtler than that. It is mostly safe to run a contained process as "root" because in theory the ways that root access can be exercised is highly sandbox by the use of various namespaces, as well as things like capability restrictions (you generally don't have CAP_SYS_ADMIN or a few others), limited syscall attack surface (you generally have a syscall allowlist via seccomp-bpf), etc. Yes, it's wrong to not understand that the runc process runs as root. But I think it's only very slightly less wrong to claim that a process inside the container has root access in the way that, say, ssh root@host-system has root access. It mostly does not, and this vulnerability is notable precisely because it's one of the rare ways to exercise that root privilege outside the container.

We looked at this at $work and got into a serious rabbit hole about how exactly Linux capabilities work. I think if I started asking interviewees to explain permitted vs. effective capability sets and how file and process capabilities differ, I'd never hire anyone. (And I think to figure out yourself how to "correctly configure" a container, you need to have at least some understanding of that.)

wicket said 2 months ago:

> I think it's subtler than that. It is mostly safe to run a process as "root" because in theory the ways that root access can be exercised is highly sandbox by the use of various namespaces, as well as things like capability restrictions (you generally don't have CAP_SYS_ADMIN or a few others), limited syscall attack surface (you generally have a syscall allowlist via seccomp-bpf), etc. Yes, it's wrong to not understand that the runc process runs as root. But I think it's only very slightly less wrong to claim that runc has root access in the way that, say, ssh root@host-system has root access. It mostly does not, and this vulnerability is notable in that it's one of the few ways to exercise that root privilege.

> We looked at this at $work and got into a serious rabbit hole about how exactly Linux capabilities work. I think if I started asking interviewees to explain permitted vs. effective capability sets and how file and process capabilities differ, I'd never hire anyone. (And I think to figure out yourself how to "correctly configure" a container, you need to have at least some understanding of that.)

I think you've hit the nail on the head. I only ask those interview questions because I believe it's important to find out just how much a candidate understands. I have to admit I've let some of these things go, otherwise I'd never hire anyone either. I think in the end what it comes down to is that Docker is an ambitious project that is somewhat flawed from a security perspective. There have been numerous namespace vulnerabilities to date and I expect there will be plenty more found in the future.

nicoburns said 2 months ago:

I believe there was work on non-root container host processes going on at some point? Did that ever get to a usable state?

geofft said 2 months ago:

Rootless containers using user namespaces was merged into runc in March 2017: https://github.com/opencontainers/runc/pull/774

A week ago Docker gained support for running dockerd as non-root: https://github.com/moby/moby/pull/38050

And there is this project for running Kubernetes as non-root: https://github.com/rootless-containers/usernetes

cyphar said 2 months ago:

Yes, it's in a usable state. Docker just merged a PR that allows you to run it rootless[1]. Props to Akihiro for bringing this one over the line, I didn't think it'd be possible three years ago.

[1]: https://github.com/moby/moby/commit/ec87479b7e2bf6f1b5bcc657...

said 2 months ago:
cyphar said 2 months ago:

> I think it's subtler than that. It is mostly safe to run a contained process as "root" because in theory the ways that root access can be exercised is highly sandbox by the use of various namespaces, as well as things like capability restrictions (you generally don't have CAP_SYS_ADMIN or a few others), limited syscall attack surface (you generally have a syscall allowlist via seccomp-bpf), etc.

I disagree. It is definitely "safer" than running as root on the host -- but that shouldn't be the baseline. Honestly, I would argue that not using user namespaces (or non-root inside the container) is basically negligence at this point. Yes, it's annoying to do it with Docker, but other runtimes have solved this problem.

LXC actually explicitly states that privileged (non-userns) containers are fundamentally unsafe and I agree. When you look at the wide array of ns_capable and other userns checks that protect against all sorts of attacks, you really start to not trust anything that doesn't use user namespaces. Kernel developers assume that container runtimes are using user namespaces if they are trying to secure something.

Additionally, yes capabilities help. But you still have traditional Unix DAC issues (which is what is leveraged here).

> and this vulnerability is notable precisely because it's one of the rare ways to exercise that root privilege outside the container.

I would argue there's several very foundational security problems (which I'm trying to fix) that are made significantly worse by running a container process as root. Please don't do it.

geofft said 2 months ago:

Can you use user namespaces with a released version of Docker or Kubernetes? I think the answer is no? (Or can you do it with Kubernetes + some other container runtime?)

We run Kubernetes + Docker with a policy of no root inside containers (we map you to your normal UID inside the container), but most of the rabbit hole we got into was trying to figure out the implications of setuid binaries inside the container. It seems like on a normal system, setuid binaries inside a container do in fact get host root unless you tell Docker to drop those capabilites, and also fscaps aren't really usable in a container image, I think.

cyphar said 2 months ago:

User namespaces have been supported in Docker since 1.10. I don't think that it's necessarily "supported" in Kubernetes -- there was a KEP to add support last year but it's still a while away.

As for setuid and fscaps, they both work in containers (container images can contain them but some filesystems don't support xattrs such as AUFS). And yes, without user namespaces, they escalate to host root. You can use no_new_privs which blocks things like setuid or fscap but it also can cause problems (though I think you can enable no_new_privs in Kubernetes).

ofrzeta said 2 months ago:

OpenShift (Kubernetes distro) is running all pods with random UIDs, and running a pod/container as root requires special privileges that only a cluster admin can grant. Also OpenShift is not vulnerable to this attack due the use of SELinux which is mandatory for OpenShift installations.

cpuguy83 said 2 months ago:

The thing is, you would have to disallow running as root in the system itself, including suid binaries and other things (e.g. docker exec -u root).

User namespaces forces this. It's not just about not running as root, it's making sure the container cannot map to the real UID 0.

tyingq said 2 months ago:

Docker runs privileged by default, doesn't it? Seems unfair to put all the blame on the end users of it.

SAI_Peregrinus said 2 months ago:

I've done pretty much nothing with Docker (we use it at work to ensure consistent builds, all I do involving is run the shell script that uses it to build the code.)

I'd say Docker is a rather spiffed-up version of BSD chroot jails, with a repository and more than just filesystem isolation. Jails for all the things.

Probably just as wrong as the "lightweight virtualization" but I found it interesting that people claiming to have experience with Docker would immediately compare it with virtualization instead of jails. Not enough BSD?

gr2020 said 2 months ago:

Looks like Docker 18.09.2 was released a few minutes ago to address this: https://github.com/docker/docker-ce/releases

CaliforniaKarl said 2 months ago:

Red Hat’s page on the vulnerability: https://access.redhat.com/security/vulnerabilities/runcescap...

RH CVE page, with the vulnerability’s metrics and the list of RH packages affected (plus links to the errata pages that have details on fixed builds): https://access.redhat.com/security/cve/cve-2019-5736

achillean said 2 months ago:

There are nearly 4,000 exposed Docker daemons: https://www.shodan.io/report/ol761bRb

spydum said 2 months ago:

If your dockerd is exposed without something like mutual TLS auth, RCE is a feature! Some little container breakout is irrelevant (I’m sure you know, but others might be confused)

nineteen999 said 2 months ago:

So inexperienced developers are just as likely to reach for Docker as experienced ones?

I wonder how many of those 4,000 docker daemons are running/managing containers of dubious origin.

achillean said 2 months ago:

Around 10% of them are running cryptominers so it looks like there are already people out there compromising these public Docker instances.

miguelmota said 2 months ago:

For better isolation check out KataContainers: https://github.com/kata-containers/runtime

It's a drop-in replacement for runc. With KataContainers it runs docker containers in a lightweight VM so you get all the security benefits of a VM. The downside is slightly slower container start up times and might not work in nested virtualized environments.

bpye said 2 months ago:

gVisor is also pretty neat, they say KVM support is experiemental though: https://github.com/google/gvisor

gVisor is used behind Go 1.11 on App Engine so Google must be fairly confident that it's a sufficient security boundary though I'm fairly sure they don't use the public KVM isolation so YMMV.

miguelmota said 2 months ago:

gVisor is a kernel implemented in userspace. The one downside of gVisor is that not all syscalls are implemented and they're relying on the community to implement them. This is what was holding me back from adopting it for a project.

said 2 months ago:
yjftsjthsd-h said 2 months ago:

Did AWS's firecracker ever get to the point of being drop-in compatible? They were also doing containers-in-VMs.

justanother- said 2 months ago:

Alternative idea: throw away docker and katacontainers and move to freebsd, where jails were introduced on 14 Mar 2000 (no, seriously, superior technology exists for 19 years - stable, time proven, working).

Some more info: https://www.freebsd.org/doc/handbook/jails.html

And for quick start: https://github.com/iocage/iocage

ajross said 2 months ago:

Jails are virtually identical technology to Linux containers from a security point of view. They've had holes before and they likely will again, and a breakout like this (seems like the root cause here is a writable file descriptor to the host binary) can absolutely compromise the host system.

The upthread recommendation was using hardware VM technology, which is a fundamentally different isolation model from what software can provide and (at least in theory) makes that kind of exploit impossible. And while there are tradeoffs with everything, for you to throw that argument out due to personal platform loyalty is really, really bad advice.

int_19h said 2 months ago:

My understanding is that jails were designed as a security boundary from the get go, unlike containers. Wouldn't that result in code that's less likely to be exploitable?

ajross said 2 months ago:

FWIW, "containers" aren't a thing. Namespaces, cgroups et. al. certainly were designed with security in mind, as was docker/runc.

Look, this isn't about whether jails are secure containers or not. I'm sure they're great. It's that responding to "if you want more isolation, try hardware virtualization" with "FreeBSD is just better because 19 years!" is not really enaging with the argument as framed.

justanother- said 2 months ago:

Sure, but there were 19 years of time proofing them. Each product has vulnerabilities which get weeded out when time passes. And for kata and docker, in context of what they are used for, they are bleeding edge.

(from a technical perspective, you would be running jails for years too - so much about platform loyality)

geofft said 2 months ago:

Vulnerabilities don't get weeded out by time like radioisotopes decaying. Vulnerabilities get weeded out by attention, and attention happens when people use a system in production to protect a high-value target.

Jails haven't been used to protect as many high-value targets as Linux containers have. This is not a comment on the technical quality of jails. It may well be a comment on the world's anti-FreeBSD prejudice. But either way it's still true, and that means the 19 years of existence didn't magically harden the product.

SteveNuts said 2 months ago:

> Jails haven't been used to protect as many high-value targets as Linux containers have

This is not true in my experience at all. It may be true that it hasn't been in use at startups until Docker came out, but a few large, established companies I've worked at absolutely used Jails or Zones to protect their most valuable IP. And have been for a long time.

geofft said 2 months ago:

What was the attack surface of the jails/zones? I don't think the distinction here is startup vs. large company but internal vs. external. We used jails at my last company as a last line of defense (and, full disclosure, I wrote about 100 lines of code to use unshare(1) etc. when that machine was our last FreeBSD box remaining in our Linux conversion), but it was on a non-internet-accessible server where the jailed network connection was routed only to a single other (much larger) business that we had an established relationship with. If attacker code were executing inside the jail, there was already a serious breach.

The distinction here is that people are running containers in the cloud and also often running untrusted code (e.g. vendor software, random exciting open-source things) inside containers, and collocating those with high-value targets in other containers. And large, established companies are doing that now just as much as startups are.

barbecue_sauce said 2 months ago:

Is there a centralized jail image repository with images of jails running popular open source software applications that I can search from the command line and spin up locally or in a cluster with a few commands? Can I easily replicate and distribute an image of a jail? Because that is what Docker offers.

peterwwillis said 2 months ago:

The major use case for Docker is really as a massively simplified package manager and an entrypoint for distributed applictions. Other features, like quicker runtime than VMs and system isolation, are just icing on the cake.

It took a massive marketing campaign to get people to use Docker and realize it made their life easier, so something like iocage would need the same push. (Also, nobody wants to start adopting additional OSes unless absolutely necessary)

halbritt said 2 months ago:

Immutability is pretty cool. Also the sibling to that where you're running the same immutable artifact in all of your environments.

Any kind of isolation is just icing on the cake.

jeswin said 2 months ago:

There have been vulnerabilities in jails previously. Also, Linux gets far more attention from exploit researchers because of wider adoption - so the number of incidents isn't a good metric. Kata has hardware isolation, so will be safer.

If I have misunderstood jails and it's immune to kernel exploits please do correct me.

justanother- said 2 months ago:

Sorry, I am not going into endless debates. It is waste of my time. Check documentation, read about it and technology is here for 19 years. I dont care what anyone useses. I have just stated what the most reliable technology for compartmentization is. Docker and kata are in IT time since yesterday and there are lots of dragons still hiding. Same goes for integration with ZFS (part of freebsd since 6 April 2007).

v_lisivka said 2 months ago:

Linux VServer project started in 2003: http://linux-vserver.org/ChangeLog-1.2 . We used it in production more than 10 years ago at multi-terabyte site (Bazaarvoice).

nisa said 2 months ago:

...or Solaris Zones / illumos Zones (2005) - you can even run Docker on them.

said 2 months ago:
howiroll said 2 months ago:

19 years without vulnerability because no one uses it seriously. Seriously. Deal with it.

justanother- said 2 months ago:

Taken from some other thread: "Most "unix" admins only know linux and will advocate for it vigorously because it is so much better than.. "what do you use again? Fedora? Ah, FreeBSD, something with F, I knew it!""

duncaen said 2 months ago:

Maybe they just want exploit mitigation techniques like ASLR.

CaliforniaKarl said 2 months ago:

Debian’s security tracker, showing the affected versions, and (when available) the fixed versions: https://security-tracker.debian.org/tracker/CVE-2019-5736

And Ubuntu’s: https://people.canonical.com/~ubuntu-security/cve/2019/CVE-2...

Personally, I like these vs. RHEL, since all the info is on page.

dfc said 2 months ago:

In addition to the information all being on one page you can also:

    git clone https://salsa.debian.org/security-tracker-team/security-tracker.git 
    git clone https://git.launchpad.net/ubuntu-cve-tracker 
There is a lot of interesting things you can do with the data.
wodny said 2 months ago:

The vulnerability description seems to be lacking an explanation why the /proc/$PID/exe symlink is so special and why using the #!/proc/self/exe hashbang will work while using #!/usr/sbin/runc probably won't. Am I right that the proc filesystem in proc_exe_link() fills the file_operations struct in a way that causes open() not to go through a dereferencing procedure using the filesystem but just open the file used to run the executable?

wodny said 2 months ago:

So I will answer myself. Experiments suggest it is like that: https://www.reddit.com/r/linux/comments/apmptq/cve20195736_r...

darren0 said 2 months ago:

The best fix is to upgrade to 18.09.2. For those that can't do that immediately, backported versions of runc for Docker releases going back to 1.12.6 are available from Rancher at https://github.com/rancher/runc-cve. But please only do that as a temporary workaround until you can properly upgrade to 18.09.2.

Please patch if you don't 100% trust all users on your host.

peterwwillis said 2 months ago:

For systems where you already enforce security policies, updating policy is faster than upgrading software, due to fewer build processes, quality control issues, and potential side-effects.

If you are using SELinux, verify your containers are running as container_t. If not, verify you are using user namespaces that don't map host root into the container user's namespace. These should mitigate the issue.

(as far as trust goes, just don't trust any local users. there's too many ways to privesc on Linux, and SELinux is the only thing that stops most of them)

geofft said 2 months ago:

I believe that if you're in an environment where users don't have (and can't gain) root inside the container, you're also fine, and if that's your theoretical policy, getting to the point where you 100% enforce that might also be easier than patching.

justincormack said 2 months ago:

We also released Docker 18.06.2 with the fix, as a lot of Kubernetes users are on this release.

olemartinorg said 2 months ago:

The new package in Ubuntu Trusty seems to be broken. Not that Trusty is supported for much longer. See https://github.com/docker/for-linux/issues/591

yujie1984 said 2 months ago:

Mesosphere employee here. We have released the product advisory on this CVE. Please check out the advisory and update your software.


geggam said 2 months ago:

Next year this exploit will still be in thousands if not millions of containers all over

There is a distinct lack of knowledge on how to manage a system in the container ecosystem

ec109685 said 2 months ago:

What do you mean _in_ containers?

said 2 months ago:
said 2 months ago:
deathanatos said 2 months ago:

Not that this shouldn't be patched and all, but this seems like it is being treated with more urgency that is required.

If I am understanding the CVE correctly, you need to be able to launch privileged containers with an attacker-controlled image where the container user is root and not namespaced (i.e., the same root as the outside root user). How is this not "on the wrong side of an airtight hatch[1]"?

Am I missing something here? If you can start privileged containers, why not just execute evil.exe directly?

[1]: https://blogs.msdn.microsoft.com/oldnewthing/20060508-22/?p=...

tinco said 2 months ago:

I run a privileged container, of which I am not the author. I also know many other people run this container in privileged mode. No one is paying that person for it, and if they want to or they get compromised at some point we all might get rooted when we update the image.

I think my OS is on a read only filesystem though, and maybe I've got it namespaced correctly as well, but still it's pretty dangerous.

tssuser said 2 months ago:

Privileged containers in docker have a different meaning [1]. A lot of work has gone into trying to harden the default docker container options against container escape, even when the process is running as root. This includes dropping some capabilities, blocking syscalls with seccomp, shadowing sensitive procfs and sysfs paths, hiding most devices, and some LSM hardening [2]. Even with all that it is far more effective to just run as non-root, but hopefully that gives some context for why vulnerabilities like this are treated as high severity.

[1] https://docs.docker.com/engine/reference/commandline/run/#fu... [2] https://docs.docker.com/engine/security/non-events/

cpuguy83 said 2 months ago:

"Privilged" means root in the container maps to root on the host, not literal "--privileged"

tyingq said 2 months ago:

Is this something that non-privileged containers mitigates? Curious what the big barriers are to this. I know they exist, but aren't used widely...I assume because some functionality doesn't work.

iwalton3 said 2 months ago:

Yes. The lxc commit[1] states that this issue only affects privileged containers. No CVE has been issued for lxc because they consider privileged containers to be insecure.

In my experience unprivileged containers work for most tasks, but there is breakage in some areas. Usually the issues are simple to resolve, like disabling OOM adjustments in systemd or changing the idmap range in winbind to be within the namespace allotment.

[1] https://github.com/lxc/lxc/commit/6400238d08cdf1ca20d49bafb8...

brauner said 2 months ago:

I've also written a smallish blogpost about this CVE. I'm a LX{C,D} maintainer and I've worked with Aleksa the runC maintainer together on a fix for this CVE: https://brauner.github.io/2019/02/12/privileged-containers.h...

arno1 said 2 months ago:

Thank you @brauner for writing this blogpost!

IIUC, using Docker's userns-remap would protect against this CVE by making the containers run unprivileged (container's id 0 != host's id 0) and should generally be the industry's best practice.

cpuguy83 said 2 months ago:

In terms of mitigation, simply running as non-root is not enough as it is very difficult to prevent escalation to root (e.g. suid binaries).

User namespaces which is setup such that uid's in the container are not mapped to any in-use uids on the host is needed.

koolba said 2 months ago:

Is this issue specific to containers running as root?

cyphar said 2 months ago:

Yes. You need to be able to run a container as root (or rather, as a user which has write access to the host runc binary -- which is usually root). User namespaces protect you for this reason.

morpheuskafka said 2 months ago:

Yet to see an Ubuntu Security Notice released, I'm presuming an update to the docker.io package will be released?

alexmurray said 2 months ago:

docker.io and runc are in universe so not officially supported by the Ubuntu Security team so won't get a Ubuntu Security Notice

pizlonator said 2 months ago:

Yikes that's a big patch! Just on a meta-level, security vulnerabilities fixed with big patches are usually the least fun.

Also, I would bet that freshly written C code has about 1 RCE bug every 100 LoC. This patch has 236 LoCs so probably about 2.36 RCE's.

loeg said 2 months ago:

1 RCE bug per 100 LoC feels like a very high bet. How did you arrive at that number? I've heard 1 bug per 100 LoC before, but RCEs are a very specific subset of bugs.

pizlonator said 2 months ago:

Yeah, I was joking. But not completely.

Even if you leave a codebase alone in the sense that you only fix security bugs, you’ll end up with a slow trickle that never quite ends. There are the RCEs that get reported plus a bunch that don’t. So, if you:

- project into the future, assuming that if there still has been a trickle of bugs being found then more bugs will also still be found in the future.

- take into account that there are some number of unreported bugs. Maybe for every reported one there is one that isn’t. Dunno the ratio there.

Put all that together and it’s not hard to imagine a 1RCE/100LoC rate.

But still I’m kinda joking. But only slightly. Maybe if I had a way to bet money on this and it was a testable bet (it’s not because of the unreported RCEs) then I’d throw some cash down.

cyphar said 2 months ago:

The version of the patch pushed to master is significantly simpler[1], and the context in which the code runs is so trivial that standard C vulnerabilities are unlikely to happen (meaning that we are doing IO to a memfd and if any errors happen we abort the process -- and immediately after the C code is set up we either execve or we boot the Go runtime which scrubs over all memory anyway).

My first attempts at this patch used Go code but it wasn't possible to protect against all cases. Doing it in C was the only way to do it.

[1]: https://github.com/opencontainers/runc/commit/0a8e4117e7f715...

megous said 2 months ago:

By your guess each new Linux release would have about 5000 new RCEs. So that's 25000-30000 new RCEs over the last year alone.

pjmlp said 2 months ago:

In 2018, 68% of Linux CVEs were caused by C's memory corruption features.

Source, Google talk at Linux Kernel Summit 2018.

geofft said 2 months ago:

See also Fish in a Barrel's Twitter account: https://twitter.com/LazyFishBarrel The vast majority of security updates these days are memory unsafety.

(Fish in a Barrel, LLC is a nonexistent security research "company" consisting of people setting up fuzzers on the weekend and then proceeding to shoot fish in a barrel.)

wahern said 2 months ago:

And how many of those were related to the difficulty of managing IPC across user/kernel boundaries, DMA, mbuf-based networking code, eBPF JIT'ing, etc? So-called memory safe languages don't help with any of that.

I'll take your memory safe languages and up the ante with microkernels, which actually help immensely in all those cases. But Linux isn't going to go that route, either.

pjmlp said 2 months ago:

Plenty of them related to lack of bounds checking handling strings and arrays.

Google has been pushing for the Kernel Self Preservation project for quite a while now, which Android and ChromeOS make best use of, also a reason why the NDK is so constrained on Android.

Security in the kernel also had quite a few talks at Linux Conf 2019, just recently in New Zealand.

loeg said 2 months ago:

68% of N, where N « 30,000.

pizlonator said 2 months ago:

I wasn’t being completely serious.

But taking the joke further, are you counting each release’s lines of code towards the RCEs or only new/modified code?

If you’re counting all vise then you’re double counting RCEs. I wouldn’t double count.

megous said 2 months ago:

New lines only. It's approx half a mil. new lines of code in each recent release.

ebeip90 said 2 months ago:

Probably only off by an order of magnitude. Check out Dmitry’s szykaller slides.


Edit: I missed the “RCE” context. Most of these are just privescs or memory disclosures.

pizlonator said 2 months ago:

Wasn’t even serious about this, but now that y’all are playing along, I’ll just dig in for fun.

It’s interesting that Linux kernel bug stats contradict my bold bet. But I’m imagining rando C code here, not necessarily open source, not necessarily in the kernel, not necessarily tested and reviewed the same way. This code is at least open source but I dunno to what extent this newly added code path in runc gets the kind of shaking out that makes kernel code solid.

Runc aside, I expect most C code to have a higher rate of every kind of bug than the kernel.

TheDong said 2 months ago:

Large swathes of C code can't have an RCE by definition.

Any C code which is not available on the network (e.g. C code running on your refrigerator) by definition cannot have a remote code execution vulnerability.

Lots of software, such as the 'top' utility, makes no networking related calls in the codebase, so any instances of bugs would be buffer overflows or crashes, but not remotely exploitable by the usual meaning.

I think that you vastly under-estimate how difficult it is to accidentally write a remotely exploitable bug.

Sure, buffer overflows and undefined behavior happen all the time in C code. Those bugs might be 1 per 100 lines even in the average C code.

of those, hardly any will be network exploitable. Relatively little code will be handling data sourced from the network.

pizlonator said 2 months ago:

That's sort of literally true except that it's hard to predict how code will be used in the future.

For example, I bet that some dude writing an image decoder in the 90's was thinking "it's cool, I don't have to worry about security" because he just knew that his code wasn't going to be remotely exploitable.

Anyway, my original comment was supposed to be as funny as your handle. I guess the humor ended up being just in how seriously folks took it.

The part I'm not joking about is that folks always underestimate the amount of security bugs that will be found in a piece of code in the future, either because the code ends up used in a way that wasn't predicted, or because some really great bug was just waiting for the right kind of genius to uncover it.

loeg said 2 months ago:

Yes. For RCEs, it's hard to accurately compute, but probably off by at least two orders of magnitude (charitably).