Hunting down a cgroup "bug"

I have upcoming exams and as it can be guessed, I tried to maximize my productivity during this time, by figuring out a “bug” I encountered while trying to get DragonflyDB working on my WSL machine! Totally producive use of time I must say

Whenever I tried to run the compiled executable, I was greeted with an error

E20231128 07:53:34.607069 80613 dfly_main.cc:702] Failed in deducing any cgroup limits with paths /sys/fs/cgroup// and /sys/fs/cgroup//. Exiting.

Turns out that it could not find the specific files associated with memory and cpu controllers for the root cgroup. Well, even tho it’s still bit unclear to me why, it seems like some controller files are not available in the root cgroup. Perhaps, since they don’t make sense to be there.

Nonetheless, I could just use cgexec the task into a child cgroup. But I noticed the root cgroup didn’t enable the cpu controller for children cgroups. So let’s try enabling them for the subtree, and…

root@CortexAuth:~# echo "+cpu" > /sys/fs/cgroup/cgroup.subtree_control
-bash: echo: write error: Invalid argument

Okay? That’s interesting. I was quite puzzled here since +cpu seems to be valid argument to be put in here per the documentation ¹. Let’s see what’s up inside the kernel perhaps? Since we’re not interacting with an actual regular file, but with a virtual file system, a procedure must handle this write on the file. A bit of grep-ing leads to this

In the above code, ss refers to the subsystem object and ssid is the numeric id for it. The parsing code has nothing suspicious in it. But we should still have a look into the execution flow

But it’s WSL!!

I was a bit persistent to do this in WSL. It was bit problematic since kgdb is gonna be a pain to work with inside WSL, if it would work at all. So I stuck with printk statements for a while. During the debug I first encountered ftrace², but it’s function graphs were clunky to filter further, and granularity of information was limited to functions. I do think that I could’ve used them in conjunction with the mechanism I mention next. Tho I didn’t do that

What actually worked amazingly were the kprobes³. They work pretty much the same way as typical debugger tools, place a breakpoint instruction and handle the hardware’s response to executing it. This further uses notifier call chain⁴ mechanism

Searching the root cause

Well I should probably give a tl;dr at this point. I waste hours kprobing different points, searching them through the vmlinux object dumps, find the root cause of “bug” and later realize that this is already a known issue. Nonetheless, it points out the complexities of cgroup implementation and why there’s bit indecisiveness on it’s design

For sake of avoiding being redundant, readers should read through this blog post by terenceli and this blog in the Linux Insides series by 0xax. Additionally, the implementation details specific to cgroup v1⁵ mostly carry over to cgroup v2 too, because of compatibility requirements.

The previously mentioned cgroup_subtree_control_write procedure’s parsing did what was intended and the execution flow continued. I was still printk-ing, but narrowed it down to the cgroup_apply_control call at cgroup.c:3429, which further calls cgroup_apply_control_enable and cgroup_update_dfl_csses

The error originated from cgroup_update_dfl_csses, which is one hog of a procedure

OOF! Before being able to tackle what’s up here, it’s important to realize the structures important to workings of cgroups. We will build upon the cgroups v1 documentation⁵ and terenceli’s blog. The following diagram is from terenceli’s blog

Because of complex relations between cgroups, tasks, css_sets and csses, I had to jot them down before tackling it further. I will drop them here for reference:

task_struct references just the css_set (cgroup subsystem state set). Why not the cgroup? Mainly performance, usually just the css matters
Different cgroups, each from different hierarchy, can play a role in parameterizing the subsystem for tasks. The nuance is that single task can now belong to multiple cgroups in different hierarchies (further clarified later). Since the susbsystems enabled in each hierarchy are disjoint, we can be sure of there being no conflict in which css is to be used.
css_set contains multiple css for each of the subsystems registered in kernel. It’s possible that they reference to one of the ancestor cgroup’s css if the subsystem is disabled further down the hierarchy. cgroup.e_csets and css_set.e_cset_node are introduced to tackle different querying requirements
css_sets can be reused in case a matching one is found. This is potentially for memory usage reduction
css_set.mg_*_node are used to hold the csets part of an ongoing migration from one cset to another
css_set.subsys is immutable. This is bit fair to see since child cgroup’s tasks can have css_sets which have the some ->subsys[ssid] point to same object as this subsys[ssid]. Ah the usual consistency problems!
As we have seen, a css_set can belong to tasks from multiple cgroups (see 4). Conversely, a cgroup may have several tasks, each using different csets. Hence, we observe a N:M relationship between cgroup and css_set. Each cgrp_cset_link references a cgroup and a css_set, entailing a relationship between them.

Few things to clear up before we move ahead:

Even in cgroup v2, we have possibility of using subsystems that don’t support v2 to bind to v1 hierarchy, or any subsystem not bound to v2 can be bound to v1 hierarchies. This is why we still have the multiple hierarchies and the nuances. Further, a default hierarchy is introduced in v2 implementation, which is the generic unified v2 hierarchy⁶. Moreover, css_set has a field dfl_cgrp, if css_set ever gets linked to a cgroup in default hierarchy, the cgroup is assigned to it (in v2)
We may wonder now when are the two csets considered to be the “same” (see 4). linux-msft-wsl-5.15.y/kernel/cgroup/cgroup.c#L1003 pretty much lays it out. Check if subsystems are the same, domains cgroups match, cgroups they are linked to match (with exception of when both css_set are on destination hierarchy, in that case, the candidate css_set should be associated with the destination cgroup). I still don’t know why the ordered comparison works tho.

After this info dump, we can finally try tackling the execution flow (which turns out not all that important later…). So referring back to the cgroup_update_dfl_csses, we see that

All csses in the subtree are traversed, and are recorded for the migration information. This information is appended to the css.mg_src_cgrp and css.mg_dst_cgrp to record source and destination cgroups of migration and the mgctx has preload_src_csets field to link together the mg_src_preload_node linkedlist heads

After a write synchronization with threadgroups, cgroup_migrate_prepare_dst is called

This procedure was bit of a weirdo one, because of a weird dependency in object manipulation being used here From a glance we see that csets loaded in migration context are traversed through and associated destination sets fetched from find_css_set. I ignored it at first since I could not see how will it be decided if a new cset is to be created or not, which was bit of a mistake since later on this renders me a bit more clueless. I will do same in this blog post for now.

So, dst_cset and src_cset are compared to check if both turn out to be same and removed from the migration context if it’s true, dst_cset is added to the migration context’s destination cset node list, and finally, we check if there is a subsystem change and alter the migration context’s ss_mask to indicate a subsystem change

Then, cgroup_migrate_add_task is called on and all cset nodes in migration context source nodes are checked for tasks and added to a taskset

There is nothing noteworthy here. We finally see the actual migration occuring in the call to cgroup_migrate_execute

The code portion performs subsystem checks to be sure that this operation will not fail when actual migration is performed. mgctx->ss_mask, which held the bits for the subsystem changes is seen here too. Since I added only a +cpu to subtree_control file, it is obvious that ss->can_attach is referencing the cpu controller. Which for some reason fails, and we finally see the cause of the invalid argument error,

My obvious guess was there is something wrong with RT scheduling so I compiled the kernel turning it off and very well the problem was no more there. Further looking around leads me to this documentation which mentions that by default new groups get runtime of 0. I kinda didn’t realize that I just stumbled on an almost-answer, and I kept pushing on through source code

Turns out, the runtime period allocated in this task_group is zero. At this point I probed the functions that create task_groups and noticed they were created down in the call to cgroup_control_apply_enable. Now I had to step back a bit, and see the css allocations

Very straightforward, cgroups in hierarchy are checked for if they have some css associated with them, and otherwise css_create is called

There’s more to the procedure but it does not matter for the discussion. ss->css_alloc is called, which will rightfully allocate a css associated with the cpu controller. Following path the execution flow we encounter this,

Aha! So the task groups are created with rt_period of zero!

To find how find_css_set works, we need to note the control propagation, where the change to subtree_control of a cgroup,

are propagated

In find_css_set procedure, a call is made to find_existing_css_set, which takes old css_set, the hierarchy is it gonna be transfered to and a template

A look into the find_existing_css_set shows some calls, to cgroup_e_css_by_mask procedure

This procedure returns the effective css for the cgroup based on the ss subsystem object. This procedure is dependent on the update propagation of the flags earlier, which sounds a bit bizarre way to handle this situation. But at least there were comments throughout source code to help me figure this out!

And with that, we have the root cause figured out, the allocated cgroup’s cset all seem to have have taskgroup with runtime of zero. At this point I finally tried googling about it and as it turns out it was already mentioned in the docs. Which was kinda dumb oversight in my part. rtkit-deamon was started by systemd in non-root cgroup during startup, with one of threads as a realtime task. Tripping up the migration process of task to the new css_set with new allocated csss for cpu subsystem

And… that’s it!

Well of course not.. I tried to find discussions regarding what’s up with status of support for real time processes. Tejun Heo seems to be mainly managing cgroup-v2 and I could not find any further discussions about it. Fixing it would probably call for lot of heuristics, since realtime scheduling knobs act as an absolute slice of total CPU time, and hence cannot be composed easily in a regular manner⁷.

Nonetheless, I did find something funny, Linux nerds left on the wild in internet space,

Time to study for exams ig. Laters!

2023-11-28

https://thiazikara.github.io/post/cgroup_rt/ Deepak Sharma