diff options
Diffstat (limited to 'guides/sysadmin/machines/amdahl30/killing-time.org')
| -rw-r--r-- | guides/sysadmin/machines/amdahl30/killing-time.org | 301 |
1 files changed, 301 insertions, 0 deletions
diff --git a/guides/sysadmin/machines/amdahl30/killing-time.org b/guides/sysadmin/machines/amdahl30/killing-time.org new file mode 100644 index 0000000..f781582 --- /dev/null +++ b/guides/sysadmin/machines/amdahl30/killing-time.org @@ -0,0 +1,301 @@ +* Failure +On November 19 2024, LDLC's off-brand SSD died on me. RIP. +Re-installed Tumbleweed on the replacement (Kingston SA400S3) on +November 28. Since then… +** Performance loss +Getting uncannily reproducible frame drops (60 ↘ 40±10, movement +visibly choppy) in Hades Ⅱ when moving toward effects/particles-heavy +areas. No idea WTF, those areas ran fine before. + +- "High" graphics setting at native 1920×1080 resolution. + - Tried "Low" graphics, lowered resolution, disabled vsync: symptoms + persist. +- Not forcing any "compatibility tool" version, assuming this yields + "Proton Experimental". + - Tried a couple of old Proton versions: symptoms persist. +- Reinstalled game & nuked everything under + - =~/.cache/mesa_shader_cache*= + - =~/.cache/radv_builtin_shaders*= + - =~/.config/unity3d= + - =~/.local/share/Steam= + - =~/.local/share/vulkan/= + - =~/.steam*= + in case "stale shaders" were to blame or something. +- Tumbleweed/Plasma/Wayland session. + - Tried X11: symptoms persist. +- Reducing noise with =balooctl6 suspend=, =swapoff -a= (RAM nowhere + near exhausted). + +Well then. +*** CPU frequency scaling? +Started by noticing that the Plasma "Power Management" tray widget +says "Power Profile" is "Not available". Not 100% sure whether that +was the case with the old installation; maybe I had had something +configured or installed to enable this? + +Internet says "install and enable power-profiles-daemon", except +that's on: + +#+begin_example +$ systemctl status power-profiles-daemon.service +● power-profiles-daemon.service - Power Profiles daemon + Loaded: loaded (/usr/lib/systemd/system/power-profiles-daemon.service; disabled; preset: disabled) + Active: active (running) since Sun 2024-12-01 11:46:32 CET; 45min ago + Invocation: b2545a02bc9642b7aeb5f370e8b50e7c + Main PID: 2289 (power-profiles-) + Tasks: 4 (limit: 18320) + CPU: 52ms + CGroup: /system.slice/power-profiles-daemon.service + └─2289 /usr/libexec/power-profiles-daemon +#+end_example + +But: + +#+begin_example +$ powerprofilesctl +,* balanced: + PlatformDriver: placeholder + + power-saver: + PlatformDriver: placeholder +#+end_example + +Internet says I am missing the right scaling driver, and seems very +keen on enabling =amd_pstate=, which I do not seem to have available: + +#+begin_example +$ cpupower frequency-info +analyzing CPU 5: + driver: acpi-cpufreq + CPUs which run at the same hardware frequency: 5 + CPUs which need to have their frequency coordinated by software: 5 + maximum transition latency: Cannot determine or is not supported. + hardware limits: 1.40 GHz - 3.70 GHz + available frequency steps: 3.70 GHz, 1.70 GHz, 1.40 GHz + available cpufreq governors: ondemand performance schedutil + current policy: frequency should be within 1.40 GHz and 3.70 GHz. + The governor "schedutil" may decide which speed to use + within this range. + current CPU frequency: Unable to call hardware + current CPU frequency: 3.30 GHz (asserted by call to kernel) + boost state support: + Supported: yes + Active: no + +$ zcat /proc/config.gz | grep -i pstate +CONFIG_X86_INTEL_PSTATE=y +CONFIG_X86_AMD_PSTATE=y +CONFIG_X86_AMD_PSTATE_DEFAULT_MODE=3 +# CONFIG_X86_AMD_PSTATE_UT is not set +#+end_example + +=/proc/config.gz= suggests the kernel configuration supports it, but +=cpupower= does not seem to know about it. =dmesg= offers: + +#+begin_example +$ sudo dmesg -H +[…] amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled +#+end_example + +Though: + +#+begin_example +$ lscpu | grep -i cppc +Flags: […] cppc […] +#+end_example + +So ACPI problem? Lots of posts mentioning =amd_= parameters on the +kernel command-line but AFAIU those are stale with newer kernels (6.11 +here) which automatically (attempt to) load the =amd_pstate= driver. + +Went through the UEFI menu and found nothing related to ACPI or +[[https://forum.level1techs.com/t/amd-p-state-driver/197885/24][X2APIC]]. Skeptical UEFI settings anyway, since I did not change them +between the old and new installations. + +/Some time later/ + +Probably not ACPI, =dmesg= is choke full of ACPI noise. OTOH, using +some diagnosis methods from [[https://bugzilla.kernel.org/show_bug.cgi?id=218171][this kernel bug report]]: + +#+begin_example +$ find /sys/devices -name '*cppc*' +🦗 +#+end_example + +(=acpidump ; acpixtract ; iasl ; grep -i cpc *.dsl= also yields 🦗, +but =iasl= complains about "unresolved" "control methods", so 🤷) + +/Some time later/ + +[[https://wiki.archlinux.org/title/CPU_frequency_scaling#amd_pstate][ArchWiki]] does say "Change /Enable CPPC/ […] from /Auto/ to /Enabled/". +My UEFI menu tucks that under /Overclocking → Advanced CPU +Configuration → AMD CBS → CPPC CTRL/. That change *does* convince +Linux to enable =amd_pstate=; going over the previous tests in reverse +order: + +#+begin_example +$ [… acpidump && acpixtract && iasl … ] && grep -i cpc *.dsl +ssdt1.dsl: Name (_CPC, Package (0x17) // _CPC: Continuous Performance Control +[… repeats 12 times …] + +$ find /sys/devices -name '*cppc*' -o -name '*pstate*' | tr -s '[:digit:]' N | sort -u +/sys/devices/system/cpu/amd_pstate +/sys/devices/system/cpu/cpufreq/policyN/amd_pstate_highest_perf +/sys/devices/system/cpu/cpufreq/policyN/amd_pstate_hw_prefcore +/sys/devices/system/cpu/cpufreq/policyN/amd_pstate_lowest_nonlinear_freq +/sys/devices/system/cpu/cpufreq/policyN/amd_pstate_max_freq +/sys/devices/system/cpu/cpufreq/policyN/amd_pstate_prefcore_ranking +/sys/devices/system/cpu/cpuN/acpi_cppc + +$ sudo dmesg -H +[… ominous silence about amd_pstate …] + +$ cpupower frequency-info +analyzing CPU 1: + driver: amd-pstate-epp + CPUs which run at the same hardware frequency: 1 + CPUs which need to have their frequency coordinated by software: 1 + maximum transition latency: Cannot determine or is not supported. + hardware limits: 400 MHz - 4.31 GHz + available cpufreq governors: performance powersave + current policy: frequency should be within 2.38 GHz and 4.31 GHz. + The governor "powersave" may decide which speed to use + within this range. + current CPU frequency: Unable to call hardware + current CPU frequency: 3.57 GHz (asserted by call to kernel) + boost state support: + Supported: yes + Active: yes + AMD PSTATE Highest Performance: 255. Maximum Frequency: 4.31 GHz. + AMD PSTATE Nominal Performance: 219. Nominal Frequency: 3.70 GHz. + AMD PSTATE Lowest Non-linear Performance: 141. Lowest Non-linear Frequency: 2.38 GHz. + AMD PSTATE Lowest Performance: 24. Lowest Frequency: 400 MHz. + +$ powerprofilesctl + performance: + CpuDriver: amd_pstate + Degraded: no + +,* balanced: + CpuDriver: amd_pstate + PlatformDriver: placeholder + + power-saver: + CpuDriver: amd_pstate + PlatformDriver: placeholder +#+end_example + +And lo, the 🍃↔🚀 slider appears in the Power Management tray widget. + +Nervous about entering the "Overclocking" UEFI zone tho, and concerned +about these "Maximum frequencies". + +/And does it even help with the game?/ + +🥁 + +No. No it does not; no discernible difference in FPS nor vibes. + +Will assume this new baseline cannot hurt - OT1H "overclocking" is +scary, OTOH Linux now has a finer handle on the CPU and hopefully will +not overwork it to death? +*** Sᴇᴠᴇʀᴀʟ Wᴇᴇᴋꜱ Lᴀᴛᴇʀ +- [[https://www.gamingonlinux.com/forum/topic/5475/page=1/][ridge reports]] "bad frame pacing on ADMGPU", + - when vsync is turned off: a non-factor in my testing, + - lots of useful information in that thread tho and + interesting-sounding pointers, + - [[https://www.gamingonlinux.com/forum/topic/5475/page=2/#r42519][Shmerl]] says: + - games can cause stutter by underloading the GPU, causing it to + drop out of "high performance mode", + - (=amdgpu_top= and =radeontop= do confirm that lag spikes + correlate with GPU usage drop) + - see [[https://gitlab.freedesktop.org/drm/amd/-/issues/1500][drm/amd#1500]]: + - /lots/ of sysfs noodling there; unfortunately, none of the + suggested settings for =power_dpm_force_performance_level= & + =pp_power_profile_mode= change the symptoms. + +- In [[https://gitlab.freedesktop.org/drm/amd/-/issues/3618#note_2689087][this drm/amd#3618 thread]], @agd5f suggests "6.11 stable kernels" + include a fix for the issue at hand there and a further rework "was + submitted to 6.13"; @mattipulkkinen reports happy results with + 6.13-rc2 (FTR, symptoms persist here with 6.12.8). + +- Piggybacked onto [[https://gitlab.freedesktop.org/mesa/mesa/-/issues/11300][mesa/mesa#11300]]: + - common: Hades Ⅱ, iGPU, recent kernel & Mesa, Proton Experimental, + - differences: Fedora, GNOME, X11, + - noteworthy: good performance on Windows, + - suggestion by @Venemo: downgrade & bisect Mesa; + - tempting, though scared of bricking graphical sessions and/or + ending up with a frankensystem (intalling binaries under a + prefix is probably easy, but then keeping track of config tweaks + and cache artifacts sounds fraught). + +- In [[https://gitlab.freedesktop.org/upower/power-profiles-daemon/-/issues/164][upower/power-profiles-daemon#164]], @Nyan reports problematic iGPU + capping; not convinced this is applicable though, given the reported + symptoms (video playback is fine here). + +- Seen reports of Variable Refresh Rate causing problems: + - searched high and low to understand why VRR appears nowhere in + Plasma settings, despite the start menu turning up "Display + Configuration" when searching for "VRR", + - mystery solved by ~kscreen-doctor -o~: =Vrr: incapable= 🤷 + +- [[https://www.techpowerup.com/forums/threads/what-fixed-stuttering-and-random-framerate-spikes-in-games-for-me.327264/][aska33j proclaims]] that /disabling CPPC/ "fixed stuttering and random + framerate spikes in games for [them]" so… roundtrip to UEFI, + disabling that. The =amd_pstate= warning is back; the "Power + Profile" slider is no longer accessible in the systray widget; no + discernible effect in-game anyway. + +- Looking at Steam forums, [[https://steamcommunity.com/app/1145350/discussions/1/596260472619121965/][some folks]] do report FPS drops /shortly + after the update/: + #+begin_quote + it started fine after the major update, now suddenly im stuck with 40~50 fps with micro sutters + — December 6 2024 + #+end_quote + +- After AMD drivers & Mesa, figured I could look at vkd3d's issue + tracker. [[https://github.com/doitsujin/dxvk/issues/4436][doitsujin/dxvk#4436]] and + [[ValveSoftware/steam-for-linux#11446]] looked somewhat promising: + reports of lag on "KDE Tumbleweed Wayland", reported not long before + my symptoms began (November 2024)); alas, ~LD_PRELOAD=~ does not + help. + - + #+begin_quote + Alternatively, remove the offending line in =/usr/share/drirc.d/00-radv-defaults.conf= + #+end_quote + + /discovers [[https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/00-radv-defaults.conf][=/usr/share/drirc.d/=]]/ + + Computers were a mistake. + +- Peeked at [[https://github.com/HansKristian-Work/vkd3d-proton/blob/master/.github/ISSUE_TEMPLATE/bug_report.md][vkd3d-proton's issue template]] and idly ran with + ~PROTON_LOG=1~. Over the course of 30 seconds or so, the log file + gets flooded with 3MB's worth of =trace:unwind:dump_unwind_info= 🤨 +*** This is insane +Selected subset of moving parts; "testability" considering ease of +clean reverts: + +| Part | Testability | +|--------------+-------------------------------------------------------------------------------------| +| Linux kernel | 🫣 [[https://en.opensuse.org/SDB:InstallNewerKernel][some distro documentation]]; afraid of side-effects | +| AMD drivers | 🤷 no clue; maybe inextricable from kernel? | +| Mesa | 😬 easy to recompile; hard to control transient state in cache & config folders | +| Steam | 🫥 under Steam's control | +| Wine | 🫥 under Steam's control | +| Proton | 👌 as long as I stick to versions under Steam's control; have not considered GE yet | +| vkd3d-proton | 🫥 under Steam's control | +| Hades Ⅱ | 🫥 under Steam's control | + +That's looking at software packages as individual blackboxes; +config-wise, worth noting: + +| Part | Testability | +|------------+-------------------| +| AMD pstate | 😬 UEFI roundtrip | +| sysfs | OK | + +Let's throw in: + +| Part | Testability | +|---------------+-----------------------------------| +| Mobo firmware | 🔥 reports of nuked boot settings | + |
