From c72fba7717067a5887f41974b055a5ebf40eb1cf Mon Sep 17 00:00:00 2001 From: Kévin Le Gouguec Date: Sat, 18 Jan 2025 21:41:30 +0100 Subject: Split off "hardware" notes Into bona-fide hardware notes, and notes specific to desktop maintenance that bear increasingly little relation to "hard"ware. --- guides/sysadmin/machines/amdahl30/README.org | 1 + guides/sysadmin/machines/amdahl30/assembly.org | 43 +++ guides/sysadmin/machines/amdahl30/maintenance.org | 397 ++++++++++++++++++++++ 3 files changed, 441 insertions(+) create mode 100644 guides/sysadmin/machines/amdahl30/README.org create mode 100644 guides/sysadmin/machines/amdahl30/assembly.org create mode 100644 guides/sysadmin/machines/amdahl30/maintenance.org (limited to 'guides/sysadmin/machines') diff --git a/guides/sysadmin/machines/amdahl30/README.org b/guides/sysadmin/machines/amdahl30/README.org new file mode 100644 index 0000000..07cdc45 --- /dev/null +++ b/guides/sysadmin/machines/amdahl30/README.org @@ -0,0 +1 @@ +An [[https://www.ldlc.com/fiche/PB00227011.html][LDLC PC Zenifier-SSD]] running Tumbleweed since April 2021. diff --git a/guides/sysadmin/machines/amdahl30/assembly.org b/guides/sysadmin/machines/amdahl30/assembly.org new file mode 100644 index 0000000..873caa3 --- /dev/null +++ b/guides/sysadmin/machines/amdahl30/assembly.org @@ -0,0 +1,43 @@ +Lots of impedance mismatches between "documentation" and actual +hardware: +- CPU cooler (fan) has spring screws; diagrams show retention clips. + Had to dig into the [[https://www.amd.com/en/support/kb/faq/cpu-7][AMD knowledge base]] to find that some + motherboards come with "speculative" clips, which must be unscrewed + and removed in order to install the spring-screw cooler. +- Diagrams say to add thermal paste, but the fan already comes with a + pre-applied layer. +- Documentation shows RAM clips for both ends of the sticks; the + motherboard seems to only have clips for one end. +- =SYS_FAN1= header has 4 pins; front fan plug has 3 holes. The + Internet says it's fine ([[https://old.reddit.com/r/buildapc/comments/4139k8/3_pin_sys_fan2_vs_4_pin_sys_fan1/][[1]​]], [[https://forums.tomshardware.com/threads/sys_fan1-and-sys_fan2.3195778/][[2]​]]). +- Motherboard has [[https://www.msi.com/Motherboard/B550M-A-PRO/Specification]["8 mounting holes"]] but covers only 6 of the case's + standoffs; none of the diagrams in the case's manual match the + format of the motherboard. +- The diagram for inserting the power supply unit leaves a lot to the + imagination. +- The [[https://www.snia.org/forums/cmsi/knowledge/formfactors#U2][SSD dimension nomenclature]] is weird as hell. The SSD's user + manual seems to imply that I have a 2.5″ model, but my measuring + tape says the drive is 2.75″×3.875″ (diagonal 4.625″). +- The link to the LDLC guide for mounting the SSD is dead; the page is + [[https://web.archive.org/web/20170901191800/http://www.ldlc.com/guides/AL00000817/comment-installer-un-ssd-dans-un-pc/][archived]], and merely contains a link to a [[https://www.youtube.com/watch?v=t1dHVb6VuWU][video]]. No matter though, + since it does not describe how to mount the drive on a 2.5″ bay. +- The case user manual says to use specific screws for the SSD drive; + the SSD comes with its own set of screws. Are they meant for the + 3.5″ adapter? 🤷 + +For novices, some steps range from "not very reassuring" to "downright +hostile": +- The amount of force needed to connect the CPU fan's first two + diagonal screws is terrifying. +- The fan's case is asymmetric: one side has a small bump featuring + the maker's brand. If one does not attention when mounting the fan, + there is a 50% chance that this bump will get in the way of a RAM + stick. +- No instruction on [[https://www.youtube.com/watch?v=XAWNzd-gc3Q&t=74s][how to force that I/O shield in]]. +- No instruction on how to snap the motherboard into the I/O shield. +- Holy =$DEITY= that power supply unit has a *lot* of cables. And of + course I enthusiastically passed most of the small-headed ones + through the designated case hole, and had to pass them back out + because there was no room left to pass the 20-pin ATX connector. +- Power supply user manual was taped to the bubble wrap, so part of + the "warnings" section got torn off. diff --git a/guides/sysadmin/machines/amdahl30/maintenance.org b/guides/sysadmin/machines/amdahl30/maintenance.org new file mode 100644 index 0000000..538ee38 --- /dev/null +++ b/guides/sysadmin/machines/amdahl30/maintenance.org @@ -0,0 +1,397 @@ +* Front panel +The case's manual has a terse illustration with two arrows to pull the +front panel "away and up" from the rest of the case. + +Here too, the amount of force required to do that is terrifying. +Notice how [[https://www.youtube.com/watch?v=nUD0HyzVpLg][our friend here]] cuts abruptly at 8:17; that's because the +levels of violence required to tear that panel off are too graphic for +YouTube. +* Front fan +Remember that fan from earlier, the one with only 3 holes for the +motherboard's 4 pins? Turns out + +1. that last "optional" pin is supposed to allow speed control; + without it, the fan always spins at full speed; +2. the fan itself (ZA1225ASL) is [[https://www.youtube.com/watch?v=pd6gDY7LPlU][complete and utter crap]]: it cannot be + disassembled, so no cleaning off the dirt, no greasing. + +So the thing is loud, it always spins at full speed, and if one day it +decides to become even louder than usual, you're SOL. +* Motherboard +** Firmware updates +Quoth ~fwupdmgr get-devices~: + +#+begin_example +WARNING: UEFI capsule updates not available or enabled in firmware setup +See https://github.com/fwupd/fwupd/wiki/PluginFlag:capsules-unsupported for more information. +#+end_example + +Quoth the wiki: + +#+begin_quote +Most typically entering the firmware setup screen and enabling capsule +updates will cause this warning to disappear, and also make firmware +updates possible. The relevant option may be poorly labelled, for +example "allow Windows UEFI updates". +#+end_quote + +Not seeing any such option in the boot menu. + +#+begin_quote +It is possible, but unlikely, that flashing the latest vendor BIOS, +using either Windows or a LiveCD, will add support for [the thing that +correlates with capsule updates being enabled]. +#+end_quote + +Well then. [[https://www.msi.com/Motherboard/B550M-A-PRO/support#bios][Vendor says]] "put this on a stick; reboot; ask the menu to +flash from the stick". Putting some feelers out first: + +#+begin_quote +If you execute a UEFI update, this update might delete the existing +UEFI boot entries + +— [[https://wiki.archlinux.org/title/GRUB#Installation][ArchWiki]], 2024 +#+end_quote + +#+begin_quote +Like others in this forum, I too suffered from a reformatted EFI +partition following a BIOS update on my desktop pc. I had no idea +that the MSI BIOS team doesn’t care about Linux installs, so to my +surprise, following the update, my system booted straight to windows. + +[…] + +Ultimately, I completely wiped and recreated the EFI partition with +gparted (fat32), changed the structure to GPT with gdisk, and then +mounted that partition in the /mnt/efi location, and then proceeded to +generate a new fstab with genfstab. After arch-chroot’ing into my +endeavoros install, I ran bootctl install (which complained about boot +loader not setting esp information) and then reinstall-kernels. I +updated the loader.conf with the correct default boot ID, and set the +recommended options. That got me back into my system after quite a +bit of trial and error. + +— [[https://forum.endeavouros.com/t/endeavoros-efi-partition-wiped-by-msi-bios-update/54740][EndeavorOS forums]], May 2024 +#+end_quote + +#+begin_quote +when updating the bios, it cleared all my settings. Apparently, this +includes clearing the list of boot loaders, which it set back to the +default of just Windows. Sadly this bios does not provide the tools +to add boot entries as, apparently, some do. To fix it, I managed to +boot to a Linux live USB and add the missing entry using the efiboomgr +command line tool. + +— [[https://forum-en.msi.com/index.php?threads/updating-to-bios-7a32v1q1-wont-see-linux-uefi-boot.388109/][MSI AMD forums]], August 2023 +#+end_quote + +Welp. + +OT1H, I could dedicate a couple of week-ends learning the joys and +wonders of efibootmgr, gdisk & friends. OTOH I sort of like keeping +my desktop station… not bricked? + +Pity, because otherwise I've had smooth and incident-free firmware +updates on other stations with ~fwupdmgr~ 🤷 +* SSD +** Failure +On November 19 2024, LDLC's off-brand SSD died on me. RIP. +Re-installed Tumbleweed on the replacement (Kingston SA400S3) on +November 28. Since then… +*** Performance loss +Getting uncannily reproducible frame drops (60 ↘ 40±10, movement +visibly choppy) in Hades Ⅱ when moving toward effects/particles-heavy +areas. No idea WTF, those areas ran fine before. + +- "High" graphics setting at native 1920×1080 resolution. + - Tried "Low" graphics, lowered resolution, disabled vsync: symptoms + persist. +- Not forcing any "compatibility tool" version, assuming this yields + "Proton Experimental". + - Tried a couple of old Proton versions: symptoms persist. +- Reinstalled game & nuked everything under + - =~/.cache/mesa_shader_cache*= + - =~/.cache/radv_builtin_shaders*= + - =~/.config/unity3d= + - =~/.local/share/Steam= + - =~/.local/share/vulkan/= + - =~/.steam*= + in case "stale shaders" were to blame or something. +- Tumbleweed/Plasma/Wayland session. + - Tried X11: symptoms persist. +- Reducing noise with =balooctl6 suspend=, =swapoff -a= (RAM nowhere + near exhausted). + +Well then. +**** CPU frequency scaling? +Started by noticing that the Plasma "Power Management" tray widget +says "Power Profile" is "Not available". Not 100% sure whether that +was the case with the old installation; maybe I had had something +configured or installed to enable this? + +Internet says "install and enable power-profiles-daemon", except +that's on: + +#+begin_example +$ systemctl status power-profiles-daemon.service +● power-profiles-daemon.service - Power Profiles daemon + Loaded: loaded (/usr/lib/systemd/system/power-profiles-daemon.service; disabled; preset: disabled) + Active: active (running) since Sun 2024-12-01 11:46:32 CET; 45min ago + Invocation: b2545a02bc9642b7aeb5f370e8b50e7c + Main PID: 2289 (power-profiles-) + Tasks: 4 (limit: 18320) + CPU: 52ms + CGroup: /system.slice/power-profiles-daemon.service + └─2289 /usr/libexec/power-profiles-daemon +#+end_example + +But: + +#+begin_example +$ powerprofilesctl +,* balanced: + PlatformDriver: placeholder + + power-saver: + PlatformDriver: placeholder +#+end_example + +Internet says I am missing the right scaling driver, and seems very +keen on enabling =amd_pstate=, which I do not seem to have available: + +#+begin_example +$ cpupower frequency-info +analyzing CPU 5: + driver: acpi-cpufreq + CPUs which run at the same hardware frequency: 5 + CPUs which need to have their frequency coordinated by software: 5 + maximum transition latency: Cannot determine or is not supported. + hardware limits: 1.40 GHz - 3.70 GHz + available frequency steps: 3.70 GHz, 1.70 GHz, 1.40 GHz + available cpufreq governors: ondemand performance schedutil + current policy: frequency should be within 1.40 GHz and 3.70 GHz. + The governor "schedutil" may decide which speed to use + within this range. + current CPU frequency: Unable to call hardware + current CPU frequency: 3.30 GHz (asserted by call to kernel) + boost state support: + Supported: yes + Active: no + +$ zcat /proc/config.gz | grep -i pstate +CONFIG_X86_INTEL_PSTATE=y +CONFIG_X86_AMD_PSTATE=y +CONFIG_X86_AMD_PSTATE_DEFAULT_MODE=3 +# CONFIG_X86_AMD_PSTATE_UT is not set +#+end_example + +=/proc/config.gz= suggests the kernel configuration supports it, but +=cpupower= does not seem to know about it. =dmesg= offers: + +#+begin_example +$ sudo dmesg -H +[…] amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled +#+end_example + +Though: + +#+begin_example +$ lscpu | grep -i cppc +Flags: […] cppc […] +#+end_example + +So ACPI problem? Lots of posts mentioning =amd_= parameters on the +kernel command-line but AFAIU those are stale with newer kernels (6.11 +here) which automatically (attempt to) load the =amd_pstate= driver. + +Went through the UEFI menu and found nothing related to ACPI or +[[https://forum.level1techs.com/t/amd-p-state-driver/197885/24][X2APIC]]. Skeptical UEFI settings anyway, since I did not change them +between the old and new installations. + +/Some time later/ + +Probably not ACPI, =dmesg= is choke full of ACPI noise. OTOH, using +some diagnosis methods from [[https://bugzilla.kernel.org/show_bug.cgi?id=218171][this kernel bug report]]: + +#+begin_example +$ find /sys/devices -name '*cppc*' +🦗 +#+end_example + +(=acpidump ; acpixtract ; iasl ; grep -i cpc *.dsl= also yields 🦗, +but =iasl= complains about "unresolved" "control methods", so 🤷) + +/Some time later/ + +[[https://wiki.archlinux.org/title/CPU_frequency_scaling#amd_pstate][ArchWiki]] does say "Change /Enable CPPC/ […] from /Auto/ to /Enabled/". +My UEFI menu tucks that under /Overclocking → Advanced CPU +Configuration → AMD CBS → CPPC CTRL/. That change *does* convince +Linux to enable =amd_pstate=; going over the previous tests in reverse +order: + +#+begin_example +$ [… acpidump && acpixtract && iasl … ] && grep -i cpc *.dsl +ssdt1.dsl: Name (_CPC, Package (0x17) // _CPC: Continuous Performance Control +[… repeats 12 times …] + +$ find /sys/devices -name '*cppc*' -o -name '*pstate*' | tr -s '[:digit:]' N | sort -u +/sys/devices/system/cpu/amd_pstate +/sys/devices/system/cpu/cpufreq/policyN/amd_pstate_highest_perf +/sys/devices/system/cpu/cpufreq/policyN/amd_pstate_hw_prefcore +/sys/devices/system/cpu/cpufreq/policyN/amd_pstate_lowest_nonlinear_freq +/sys/devices/system/cpu/cpufreq/policyN/amd_pstate_max_freq +/sys/devices/system/cpu/cpufreq/policyN/amd_pstate_prefcore_ranking +/sys/devices/system/cpu/cpuN/acpi_cppc + +$ sudo dmesg -H +[… ominous silence about amd_pstate …] + +$ cpupower frequency-info +analyzing CPU 1: + driver: amd-pstate-epp + CPUs which run at the same hardware frequency: 1 + CPUs which need to have their frequency coordinated by software: 1 + maximum transition latency: Cannot determine or is not supported. + hardware limits: 400 MHz - 4.31 GHz + available cpufreq governors: performance powersave + current policy: frequency should be within 2.38 GHz and 4.31 GHz. + The governor "powersave" may decide which speed to use + within this range. + current CPU frequency: Unable to call hardware + current CPU frequency: 3.57 GHz (asserted by call to kernel) + boost state support: + Supported: yes + Active: yes + AMD PSTATE Highest Performance: 255. Maximum Frequency: 4.31 GHz. + AMD PSTATE Nominal Performance: 219. Nominal Frequency: 3.70 GHz. + AMD PSTATE Lowest Non-linear Performance: 141. Lowest Non-linear Frequency: 2.38 GHz. + AMD PSTATE Lowest Performance: 24. Lowest Frequency: 400 MHz. + +$ powerprofilesctl + performance: + CpuDriver: amd_pstate + Degraded: no + +,* balanced: + CpuDriver: amd_pstate + PlatformDriver: placeholder + + power-saver: + CpuDriver: amd_pstate + PlatformDriver: placeholder +#+end_example + +And lo, the 🍃↔🚀 slider appears in the Power Management tray widget. + +Nervous about entering the "Overclocking" UEFI zone tho, and concerned +about these "Maximum frequencies". + +/And does it even help with the game?/ + +🥁 + +No. No it does not; no discernible difference in FPS nor vibes. + +Will assume this new baseline cannot hurt - OT1H "overclocking" is +scary, OTOH Linux now has a finer handle on the CPU and hopefully will +not overwork it to death? +**** Sᴇᴠᴇʀᴀʟ Wᴇᴇᴋꜱ Lᴀᴛᴇʀ +- [[https://www.gamingonlinux.com/forum/topic/5475/page=1/][ridge reports]] "bad frame pacing on ADMGPU", + - when vsync is turned off: a non-factor in my testing, + - lots of useful information in that thread tho and + interesting-sounding pointers, + - [[https://www.gamingonlinux.com/forum/topic/5475/page=2/#r42519][Shmerl]] says: + - games can cause stutter by underloading the GPU, causing it to + drop out of "high performance mode", + - (=amdgpu_top= and =radeontop= do confirm that lag spikes + correlate with GPU usage drop) + - see [[https://gitlab.freedesktop.org/drm/amd/-/issues/1500][drm/amd#1500]]: + - /lots/ of sysfs noodling there; unfortunately, none of the + suggested settings for =power_dpm_force_performance_level= & + =pp_power_profile_mode= change the symptoms. + +- In [[https://gitlab.freedesktop.org/drm/amd/-/issues/3618#note_2689087][this drm/amd#3618 thread]], @agd5f suggests "6.11 stable kernels" + include a fix for the issue at hand there and a further rework "was + submitted to 6.13"; @mattipulkkinen reports happy results with + 6.13-rc2 (FTR, symptoms persist here with 6.12.8). + +- Piggybacked onto [[https://gitlab.freedesktop.org/mesa/mesa/-/issues/11300][mesa/mesa#11300]]: + - common: Hades Ⅱ, iGPU, recent kernel & Mesa, Proton Experimental, + - differences: Fedora, GNOME, X11, + - noteworthy: good performance on Windows, + - suggestion by @Venemo: downgrade & bisect Mesa; + - tempting, though scared of bricking graphical sessions and/or + ending up with a frankensystem (intalling binaries under a + prefix is probably easy, but then keeping track of config tweaks + and cache artifacts sounds fraught). + +- In [[https://gitlab.freedesktop.org/upower/power-profiles-daemon/-/issues/164][upower/power-profiles-daemon#164]], @Nyan reports problematic iGPU + capping; not convinced this is applicable though, given the reported + symptoms (video playback is fine here). + +- Seen reports of Variable Refresh Rate causing problems: + - searched high and low to understand why VRR appears nowhere in + Plasma settings, despite the start menu turning up "Display + Configuration" when searching for "VRR", + - mystery solved by ~kscreen-doctor -o~: =Vrr: incapable= 🤷 + +- [[https://www.techpowerup.com/forums/threads/what-fixed-stuttering-and-random-framerate-spikes-in-games-for-me.327264/][aska33j proclaims]] that /disabling CPPC/ "fixed stuttering and random + framerate spikes in games for [them]" so… roundtrip to UEFI, + disabling that. The =amd_pstate= warning is back; the "Power + Profile" slider is no longer accessible in the systray widget; no + discernible effect in-game anyway. + +- Looking at Steam forums, [[https://steamcommunity.com/app/1145350/discussions/1/596260472619121965/][some folks]] do report FPS drops /shortly + after the update/: + #+begin_quote + it started fine after the major update, now suddenly im stuck with 40~50 fps with micro sutters + — December 6 2024 + #+end_quote + +- After AMD drivers & Mesa, figured I could look at vkd3d's issue + tracker. [[https://github.com/doitsujin/dxvk/issues/4436][doitsujin/dxvk#4436]] and + [[ValveSoftware/steam-for-linux#11446]] looked somewhat promising: + reports of lag on "KDE Tumbleweed Wayland", reported not long before + my symptoms began (November 2024)); alas, ~LD_PRELOAD=~ does not + help. + - + #+begin_quote + Alternatively, remove the offending line in =/usr/share/drirc.d/00-radv-defaults.conf= + #+end_quote + + /discovers [[https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/00-radv-defaults.conf][=/usr/share/drirc.d/=]]/ + + Computers were a mistake. + +- Peeked at [[https://github.com/HansKristian-Work/vkd3d-proton/blob/master/.github/ISSUE_TEMPLATE/bug_report.md][vkd3d-proton's issue template]] and idly ran with + ~PROTON_LOG=1~. Over the course of 30 seconds or so, the log file + gets flooded with 3MB's worth of =trace:unwind:dump_unwind_info= 🤨 +**** This is insane +Selected subset of moving parts; "testability" considering ease of +clean reverts: + +| Part | Testability | +|--------------+-------------------------------------------------------------------------------------| +| Linux kernel | 🫣 [[https://en.opensuse.org/SDB:InstallNewerKernel][some distro documentation]]; afraid of side-effects | +| AMD drivers | 🤷 no clue; maybe inextricable from kernel? | +| Mesa | 😬 easy to recompile; hard to control transient state in cache & config folders | +| Steam | 🫥 under Steam's control | +| Wine | 🫥 under Steam's control | +| Proton | 👌 as long as I stick to versions under Steam's control; have not considered GE yet | +| vkd3d-proton | 🫥 under Steam's control | +| Hades Ⅱ | 🫥 under Steam's control | + +That's looking at software packages as individual blackboxes; +config-wise, worth noting: + +| Part | Testability | +|------------+-------------------| +| AMD pstate | 😬 UEFI roundtrip | +| sysfs | OK | + +Let's throw in: + +| Part | Testability | +|---------------+-----------------------------------| +| Mobo firmware | 🔥 reports of nuked boot settings | + -- cgit v1.2.3