Hunting down a cgroup "bug"
I have upcoming exams and as it can be guessed, I tried to maximize my productivity during this time, by figuring out a “bug” I encountered while trying to get DragonflyDB working on my WSL machine! Totally producive use of time I must say
Whenever I tried to run the compiled executable, I was greeted with an error
E20231128 07:53:34.607069 80613 dfly_main.cc:702] Failed in deducing any cgroup limits with paths /sys/fs/cgroup// and /sys/fs/cgroup//. Exiting.
Turns out that it could not find the specific files associated with memory
and cpu
controllers for the root cgroup. Well, even tho it’s still bit unclear to me why, it seems like some controller files are not available in the root cgroup. Perhaps, since they don’t make sense to be there.
Nonetheless, I could just use cgexec
the task into a child cgroup. But I noticed the root cgroup didn’t enable the cpu controller for children cgroups. So let’s try enabling them for the subtree, and…
root@CortexAuth:~# echo "+cpu" > /sys/fs/cgroup/cgroup.subtree_control
-bash: echo: write error: Invalid argument
Okay? That’s interesting. I was quite puzzled here since +cpu
seems to be valid argument to be put in here per the documentation 1. Let’s see what’s up inside the kernel perhaps? Since we’re not interacting with an actual regular file, but with a virtual file system, a procedure must handle this write on the file. A bit of grep
-ing leads to this
In the above code, ss
refers to the subsystem object and ssid
is the numeric id for it. The parsing code has nothing suspicious in it. But we should still have a look into the execution flow
But it’s WSL!!
I was a bit persistent to do this in WSL. It was bit problematic since kgdb
is gonna be a pain to work with inside WSL, if it would work at all. So I stuck with printk
statements for a while. During the debug I first encountered ftrace2, but it’s function graphs were clunky to filter further, and granularity of information was limited to functions. I do think that I could’ve used them in conjunction with the mechanism I mention next. Tho I didn’t do that
What actually worked amazingly were the kprobes3. They work pretty much the same way as typical debugger tools, place a breakpoint instruction and handle the hardware’s response to executing it. This further uses notifier call chain4 mechanism
Searching the root cause
Well I should probably give a tl;dr at this point. I waste hours kprobing different points, searching them through the vmlinux
object dumps, find the root cause of “bug” and later realize that this is already a known issue. Nonetheless, it points out the complexities of cgroup implementation and why there’s bit indecisiveness on it’s design
For sake of avoiding being redundant, readers should read through this blog post by terenceli and this blog in the Linux Insides series by 0xax. Additionally, the implementation details specific to cgroup v15 mostly carry over to cgroup v2 too, because of compatibility requirements.
The previously mentioned cgroup_subtree_control_write
procedure’s parsing did what was intended and the execution flow continued. I was still printk
-ing, but narrowed it down to the cgroup_apply_control
call at cgroup.c:3429
, which further calls cgroup_apply_control_enable
and cgroup_update_dfl_csses
The error originated from cgroup_update_dfl_csses
, which is one hog of a procedure
OOF! Before being able to tackle what’s up here, it’s important to realize the structures important to workings of cgroups. We will build upon the cgroups v1 documentation5 and terenceli’s blog. The following diagram is from terenceli’s blog
data:image/s3,"s3://crabby-images/085f9/085f934c68a77ebb163dd28fb08ee9e9c192641c" alt=""
Because of complex relations between cgroups, tasks, css_sets and csses, I had to jot them down before tackling it further. I will drop them here for reference:
-
task_struct
references just thecss_set
(cgroup subsystem state set). Why not the cgroup? Mainly performance, usually just thecss
matters -
Different cgroups, each from different hierarchy, can play a role in parameterizing the subsystem for tasks. The nuance is that single
task
can now belong to multiple cgroups in different hierarchies (further clarified later). Since the susbsystems enabled in each hierarchy are disjoint, we can be sure of there being no conflict in whichcss
is to be used. -
css_set
contains multiplecss
for each of the subsystems registered in kernel. It’s possible that they reference to one of the ancestor cgroup’scss
if the subsystem is disabled further down the hierarchy.cgroup.e_csets
andcss_set.e_cset_node
are introduced to tackle different querying requirements -
css_set
s can be reused in case a matching one is found. This is potentially for memory usage reduction -
css_set.mg_*_node
are used to hold thecset
s part of an ongoing migration from onecset
to another -
css_set.subsys
is immutable. This is bit fair to see since child cgroup’s tasks can havecss_set
s which have the some->subsys[ssid]
point to same object as thissubsys[ssid]
. Ah the usual consistency problems! -
As we have seen, a
css_set
can belong to tasks from multiplecgroup
s (see 4). Conversely, acgroup
may have several tasks, each using differentcset
s. Hence, we observe a N:M relationship betweencgroup
andcss_set
. Eachcgrp_cset_link
references acgroup
and acss_set
, entailing a relationship between them.
Few things to clear up before we move ahead:
-
Even in cgroup v2, we have possibility of using subsystems that don’t support v2 to bind to v1 hierarchy, or any subsystem not bound to v2 can be bound to v1 hierarchies. This is why we still have the multiple hierarchies and the nuances. Further, a
default hierarchy
is introduced in v2 implementation, which is the generic unified v2 hierarchy6. Moreover,css_set
has a fielddfl_cgrp
, ifcss_set
ever gets linked to a cgroup in default hierarchy, the cgroup is assigned to it (in v2) -
We may wonder now when are the two
cset
s considered to be the “same” (see 4). linux-msft-wsl-5.15.y/kernel/cgroup/cgroup.c#L1003 pretty much lays it out. Check if subsystems are the same, domains cgroups match, cgroups they are linked to match (with exception of when bothcss_set
are on destination hierarchy, in that case, the candidatecss_set
should be associated with the destinationcgroup
). I still don’t know why the ordered comparison works tho.
After this info dump, we can finally try tackling the execution flow (which turns out not all that important later…). So referring back to the cgroup_update_dfl_csses
, we see that
All css
es in the subtree are traversed, and are recorded for the migration information. This information is appended to the css.mg_src_cgrp
and css.mg_dst_cgrp
to record source and destination cgroups of migration and the mgctx
has preload_src_csets
field to link together the mg_src_preload_node
linkedlist heads
After a write synchronization with threadgroups, cgroup_migrate_prepare_dst
is called
This procedure was bit of a weirdo one, because of a weird dependency in object manipulation being used here
From a glance we see that cset
s loaded in migration context are traversed through and associated destination sets fetched from find_css_set
. I ignored it at first since I could not see how will it be decided if a new cset is to be created or not, which was bit of a mistake since later on this renders me a bit more clueless. I will do same in this blog post for now.
So, dst_cset
and src_cset
are compared to check if both turn out to be same and removed from the migration context if it’s true, dst_cset
is added to the migration context’s destination cset node list, and finally, we check if there is a subsystem change and alter the migration context’s ss_mask
to indicate a subsystem change
Then, cgroup_migrate_add_task
is called on and all cset
nodes in migration context source nodes are checked for tasks and added to a taskset
There is nothing noteworthy here. We finally see the actual migration occuring in the call to cgroup_migrate_execute
The code portion performs subsystem checks to be sure that this operation will not fail when actual migration is performed. mgctx->ss_mask
, which held the bits for the subsystem changes is seen here too. Since I added only a +cpu
to subtree_control
file, it is obvious that ss->can_attach
is referencing the cpu
controller. Which for some reason fails, and we finally see the cause of the invalid argument error,
My obvious guess was there is something wrong with RT scheduling so I compiled the kernel turning it off and very well the problem was no more there. Further looking around leads me to this documentation which mentions that by default new groups get runtime of 0. I kinda didn’t realize that I just stumbled on an almost-answer, and I kept pushing on through source code
Turns out, the runtime period allocated in this task_group
is zero. At this point I probed the functions that create task_group
s and noticed they were created down in the call to cgroup_control_apply_enable
. Now I had to step back a bit, and see the css
allocations
Very straightforward, cgroup
s in hierarchy are checked for if they have some css
associated with them, and otherwise css_create
is called
There’s more to the procedure but it does not matter for the discussion. ss->css_alloc
is called, which will rightfully allocate a css
associated with the cpu
controller. Following path the execution flow we encounter this,
Aha! So the task groups are created with rt_period
of zero!
To find how find_css_set
works, we need to note the control propagation, where the change to subtree_control
of a cgroup
,
are propagated
In find_css_set
procedure, a call is made to find_existing_css_set
, which takes old css_set
, the hierarchy is it gonna be transfered to and a template
A look into the find_existing_css_set
shows some calls, to cgroup_e_css_by_mask
procedure
This procedure returns the effective css
for the cgroup
based on the ss
subsystem object. This procedure is dependent on the update propagation of the flags earlier, which sounds a bit bizarre way to handle this situation. But at least there were comments throughout source code to help me figure this out!
And with that, we have the root cause figured out, the allocated cgroup’s cset
all seem to have have taskgroup with runtime of zero. At this point I finally tried googling about it and as it turns out it was already mentioned in the docs. Which was kinda dumb oversight in my part. rtkit-deamon
was started by systemd
in non-root cgroup during startup, with one of threads as a realtime task. Tripping up the migration process of task to the new css_set
with new allocated css
s for cpu subsystem
And… that’s it!
Well of course not.. I tried to find discussions regarding what’s up with status of support for real time processes. Tejun Heo seems to be mainly managing cgroup-v2 and I could not find any further discussions about it. Fixing it would probably call for lot of heuristics, since realtime scheduling knobs act as an absolute slice of total CPU time, and hence cannot be composed easily in a regular manner7.
Nonetheless, I did find something funny, Linux nerds left on the wild in internet space,
Time to study for exams ig. Laters!