Watchdog setup for Supermicro server

EPYC server processors are really nice, when they work. However, lately with kernel 6.2 I started getting dreadful “CPU stuck” errors that lead to hanging system. Normal person might revert to an older kernel. Me? I decided to turn on the watchdog.

In case you don’t know, watchdog is a functionality that, once turned on, will require your system to notify it every once in a while that it’s still active. If notification is not received within given time interval, system is assumed stuck and thus it gets rebooted. Best of all, this is done on a hardware level and thus no hanging application or CPU will prevent it.

To confuse things a bit, my Supermicro M11SDV-4CT-LN4F server, seems to have two watchdog systems. One is part of Epyc platform itself and controlled via BIOS setting. That one has 5 minute interval and no matter what I couldn’t get it working properly. I mean, I could get it running but, since there was no easy way to reset it, system would reboot every 5 minutes, no matter what.

The second watchdog is the part of AST2500 chipset that handles other IPMI functions. And this one was well supported from Linux command line using ipmitool utility. To see its status, just ask ipmitool for that information:

ipmitool mc watchdog get

But there is no option to turn it on. However, one can always send raw commands and I was fortunate to see that somebody already did. Not to get into too much details, the last two numbers in the string of hexadecimal values are the only thing you generally want to change - time interval. In example below, I decided to go for 610 seconds (0x17D4 in 0.1 s units).

ipmitool raw 0x06 0x24 0x04 0x01 0x00 0x00 0xD4 0x17

This will start a ticking bomb that will, if not defused within the given interval, reboot your computer. So, why did I select 10 minutes and 10 seconds? As many things, this was completely subjective.

Well, no matter what, I wanted this watchdog not to interfere with my normal server operation. Since a normal reboot takes about 5 minutes, I wanted to have 5 minutes on counter even if I reboot system myself just before watchdog would reset. So, if I select 10 minute interval and reset it every 5 minutes, this gives me 5 minutes of extra time I might need for reboot. But why extra 10 seconds? Well, in case I mess with my watchdog settings and I miss reset at 5 minute mark, I wanted to give an extra chance of reset at 10 minute mark without having to deal with a reboot race condition.

And how might one actually setup watchdog and its reset within Linux? Well, crontab, of course. These two entries were all it took:

@reboot
/usr/bin/ipmitool raw 0x06 0x24 0x04 0x01 0x00 0x00 0xD4 0x17

0,5,10,15,20,25,30,35,40,45,50,55 * * * *
/usr/bin/ipmitool mc watchdog reset

This will turn on watchdog upon every system reset (and yes, once watchdog goes off, you do need to manually turn it back on) and every 5 minutes system will reset its counter if nothing goes awry.

Simple and effective.

ZFS Encryption Speed (Ubuntu 23.10)

There is a newer version of this post

As it became a custom, I retest ZFS native encryption performance with each new Ubuntu release. Here we have results for Ubuntu 23.10 on kernel 6.5 using ZFS 2.2.

Testing was done on a Framework laptop with an i5-1135G7 processor and 64GB of RAM. Once booted into installation media, I execute the script that creates 42 GiB RAM disk that hosts all data for six 6 GiB files. Those files are then used in RAIDZ2 configuration to create a ZFS pool. The process is repeated multiple times to test all different native ZFS encryption modes in addition to a LUKS-based test. This whole process is repeated again with AES disabled and later with a reduced core count.

Illustration

Since I am testing on the same hardware as for 23.04 and using essentially the same script, I expected similar results but I was slightly surprised to see both raw read and write speed has been reduced by more than 10%. I am not sure if this is due to the new kernel, new BIOS, or some other combination of changes but the performance hit seems quite significant.

However, what I’m most interested in is not necessarily the actual speed but how it’s impacted by encryption. As compared to last year, it seems each GCM encryption mode has taken a few percent hit. We’re still talking about 2 GiB/s for both read and write so I’m not too worried.

Interestingly, while key size had more impact before, it seems that with 23.10 you can count on the same speed regardless if you select 128, 192, or 256 bit key.

If you don’t have AES support and you need CCM, the news is not that good as that code path has gotten significantly worse. Unless you’re stuck on an ancient CPU this is irrelevant I guess as you should never opt for CCM in the first place.

Using ZFS on top of LUKS has gotten slightly better when it comes to writes where it actually lagged the most behind the native ZFS. The improvement is significant but we’re still talking about 30% lower speeds. On read size, there are no changes and it’s the only area where LUKS wins over the native ZFS encryption.

For this release, I also experimentally tried to get power usage for each test run. I did the same by disconnecting the battery and measuring the power the laptop was drawing. This is not the most precise way of measuring it so I might be off but it looked as ZFS encryption was as efficient as it gets when it comes to the power usage.

To summarize, the native ZFS encryption is still live and kicking in Ubuntu 23.10 and might even provide some power usage advantages as compared to LUKS.


PS: You can find older tests for Ubuntu 23.04, 22.10, 22.04, 20.10, and 20.04.

Counting Geigers

Illustration

A long while ago I got myself a Geiger counter, soldered it together, stored it in a drawer, and forgot about it for literally more than a year. However, once I found it again, I did the only thing that could be done - I connected it to a computer and decided to track its values.

My default approach would usually be to create an application to track it. But, since I connected it to a Linux server, it seemed appropriate to merge this functionality into a bash script that already sends ZFS data to my telegraf server.

Since my Geiger counter has UART output, the natural way of collecting its output under Linux was a simple cat command:

cat /dev/ttyUSB0

That gave me a constant output of readings, about once a second:

CPS, 0, CPM, 12, uSv/hr, 0.06, SLOW

CPS, 1, CPM, 13, uSv/hr, 0.07, SLOW

CPS, 0, CPM, 13, uSv/hr, 0.07, SLOW

CPS, 1, CPM, 14, uSv/hr, 0.07, SLOW

CPS, 1, CPM, 15, uSv/hr, 0.08, SLOW

CPS, 0, CPM, 15, uSv/hr, 0.08, SLOW

In order to parse this line easier, I wanted two things. First, to remove extra empty lines caused by CRLF line ending, and secondly to have all values separated by a single space. Simple adjustment sorted that out:

cat /dev/ttyUSB0 | sed '/^$/d' | tr -d ','

While this actually gave me a perfectly usable stream of data, it would never exit. It would just keep showing new data. Perfectly suitable for some uses, but I wanted my script just to take the last data once a minute and be done with it. And no, you cannot just use tail command alone for this - it needs to be combined with something that will stop the stream - like timeout.

timeout --foreground 1 cat /dev/ttyUSB0 | sed '/^$/d' | tr -d ',' | tail -1

If we place this into a variable, we can extract specific values - I just ended up using uSv/h but the exact value might depend on your use case.

OUTPUT=`timeout --foreground 1 cat /dev/ttyUSB0 | sed '/^$/d' | tr -d ',' | tail -1`
CPS=`echo $OUTPUT | awk '{print $2}'`
CPM=`echo $OUTPUT | awk '{print $4}'`
USVHR=`echo $OUTPUT | awk '{print $6}'`
MODE=`echo $OUTPUT | awk '{print $7}'`

With those variables in hand, you can feed whatever upstream data sink you want.

Randomizing Serial Number During MPLAB X Build

Illustration

Quite often, especially when dealing with USB devices, a serial number comes really handy. One example is my TmpUsb project. If you don’t update its serial number, it will still work when only one is plugged in. But plug in two, and all kinds of shenanigans will ensue.

This is the exact reason why I already created a script to randomize its serial number. However, that script had one major fault - it didn’t work under Linux. So, it came time to rewrite the thing and maybe adjust it a bit.

The way the original script worked was as a post-build step in MPLAB X project that would patch the Intel Hex object file and that’s something I won’t change as it integrates flawlessly with programming steps.

The resulting serial number was 12 hexadecimal digits in length (48 bits or random data) and that was probably excessive. Even worse, that led to an overly complicated script that was essentially a C# program to patch hex after the modification was done since any randomization always impacted more than 1 line. As I wanted a solution that could work anywhere the Bash can run (even Windows), I wanted to make my life easier by limiting any change to a single line.

Well, first things first, to limit serial number length, I had to figure out how much of a serial number can fit in one. Looking at the Intel hex produced by MPLAB X, we can see that each line is 16 bytes, which means any serial number intended for USB consumption can be up to 8 characters. However, what if other code pushes the serial number further down the line? Well, now you get only a single character.

What we need is a way to fix the serial number location. The solution to this is in the __at keyword. Using it, we can align our string descriptors wherever we want them. In the TmpUsb example, that would be something like this

const struct {
    uint8_t bLength;
    uint8_t bDscType;
    uint16_t string[7];
} sd003 __at(0x1000) = {
    sizeof(sd003),
    USB_DESCRIPTOR_STRING,
    { '2','8','4','4','3','4','2' }
};

The whole USB descriptor structure has to fit into 16 bytes as to limit any subsequent modification to the single line. The first 2 bytes are length and type bytes, leaving us with 14 bytes for our serial. Since USB likes 16-bit unicode, this means we have 7 characters to play with. If we stay in the hexadecimal realm, this provides 28 bits of randomization. Not bad, but we can slightly improve on it by expanding character selection a bit.

That’s where base32 comes in. It’s a nice enough encoding that isn’t case sensitive and it omits most easily confused characters. And yes, it would take 40 bits to fully utilize base32 but trimming it at 7 will leave you with 35 bits which is plenty.

How do I get this serial number in an easy way? Well, getting 5 bytes using dd and then passing it to base32 will do.

NEW_SERIAL=`dd if=/dev/urandom bs=5 count=1 2>/dev/null | base32 | cut -c 1-7`

If we pass the non-random serial number in, with some mangling to get it expanded, it is trivial to swap it with the new one using sed:

LINE=`cat "$INPUT" | grep "$SERIAL_UNICODE"`
NEW_LINE=`echo -n "$LINE" | sed "s/$SERIAL_UNICODE/$NEW_SERIAL_UNICODE/g"`
sed -i "s/$LINE/$NEW_NEW_LINE/g" "$INPUT"

But no, this is no good. Changing content of a line will invalidate checksum character that comes at the end. To make this work we need to adjust that checksum. Thankfully, checksum is just a sum of all characters preceding with a bit of inversion as a last step. Something like this:

CHECKSUM=0
for ((i = 1; i < $(( ${#NEW_LINE} - 2 )); i += 2)); do
    BYTE_HEX="${NEW_LINE:$i:2}"
    NEW_NEW_LINE="$NEW_NEW_LINE$BYTE_HEX"
    BYTE_VALUE=$(printf "%d" 0x$BYTE_HEX)
    CHECKSUM=$(( (CHECKSUM + BYTE_VALUE) % 256 ))
done
CHECKSUM_HEX=`printf "%02X" $(( (~CHECKSUM + 1) % 256 )) | tail -c 2`

Once all this is combined into a script, we can call it by giving it a few arguments, most notably, project directory (${ProjectDir}), image path (${ImagePath}), and our predefined serial number (2844342):

bash "${ProjectDir}/../package/randomize-usb-serial.sh"
  "${ProjectDir}"
  "${ImagePath}"
  "2844342"

Script is available for download here but you can also see it in action as part of TmpUsb.

MPLABX and the Symbol Lookup Error

After performing the latest Ubuntu upgrade, my MPLAB X v6.15 installation stopped working. It would briefly display the splash screen and then crash. When attempting to run it manually, I encountered a symbol lookup error in libUSBAccessLink_3_38.so.

$ /opt/microchip/mplabx/v6.15/mplab_platform/bin/mplab_ide

/opt/microchip/mplabx/v6.15/sys/java/zulu8.64.0.19-ca-fx-jre8.0.345-linux_x64/bin/java: symbol lookup error: /tmp/mplab_ide/mplabcomm4864006927221691126/mplabcomm5312997795113373971libUSBAccessLink_3_38.so: undefined symbol: libusb_handle_events_timeout_completed

Since there were a few links to the old MPLAB directory, I cleaned up /usr/lib a bit, but that didn’t help. What did help was removing the older libusb altogether.

sudo rm /usr/local/lib/libusb-1.0.so.0

With that link out of the way, everything started working once again.