Programming in C#, Java, and god knows what not

Welford's Algorithm

Most of the time, general statistics calculations are easy. Take all data points you have, find the average, the standard deviation, and you have 90% of stuff you need to present a result in a reasonable and familiar manner. However, what if you have streaming data?

Well, then you have a Welford’s method. This magical algorithm enables you to calculate both average and standard deviation as data arrives without wasting a lot of memory accumulating it all.

So, of course, I wrote a C# code for it. To use it, just add something like this:

var stats = new WelfordVariance();
while(true) {
    stats.Add(something.getValue());
    output.SetMean(stats.Mean);
    output.SetStandardDeviation(stats.StandardDeviation);
}

Algorithm is not perfect and will sometime slightly differ from classically calculated standard deviation but it’s generally within a spitting distance and uses minimum of memory. It’s hard to find better bang-for-buck when it comes to large datasets.

Better Pseudorandom Numbers

Browsing Internet out of boredom usually brings a lot of nonsense. However, it occasionally also brings a gem. This time I accidentally stumbled upon a family of random algorithm called xoshiro/xoroshiro.

Pseudo-random generators fell out of favor lately as it proper cryptographically-secure algorithms became ubiquitous on modern computers (and often supported by processors RNG). For cases where pseudo-random generators are better fit, most programming languages already include Mersenne twister allowing generation of reasonable randomness.

But that doesn’t mean research into a better (pseudo)randomness has stopped. From that research comes whitepaper named Scrambled linear pseudorandom number generators. Paper alone goes over the algorithms in detail but authors were also kind enough to provide PRNG shootout page giving a practical advise.

After spending quite a few hours with these, I decided that only thing missing is a C# variant of the same. So I created it.

Links to source and NuGet package are here.

Uniform Distribution from Integer in C#

When dealing with random numbers, one often needs to get a random floating point number between 0 and 1 (not inclusive). Unfortunately, most random generators only deal with integers and/or bytes. Since bytes can be easily converted to integers, question becomes: “How can I convert integer to double number in [0..1) range?”

Well, assuming you start with 64-bin unsigned integer, you can use something like this:

ulong value = 1234567890; //some random value
byte[] buffer = BitConverter.GetBytes((ulong)0x3FF << 52 | value >> 12);
return BitConverter.ToDouble(buffer, 0) - 1.0;

With that in mind you you can see that 8-byte (64-bit) buffer is filled with double format combined of (almost) all 1's in exponent and the fraction portion containing random number. If we take that raw buffer and convert it into a double, we’ll get a number in [1..2) range. Simply substracting 1.0 will place it in our desired [0..1) range. It’s as good as distribution of a double can be (i.e. the maximum number of bits - 56 - are used).

This is as good as uniform distribution can get using 64-bit double.


PS: If we apply the same principle to the float, equivalent code will be something like this (assuming a 32-bit random uint as input):

uint value = 1234567890; //some random value
byte[] buffer = BitConverter.GetBytes((uint)0x7F << 23 | value >> 9);
return BitConverter.ToSingle(buffer, 0) - 1.0;

PPS: C#'s Random class uses code that’s probably a bit easier to understand:

return value * (1.0 / Int32.MaxValue);

Unfortunately, this will use only 31 bits for distribution (instead of 52 available in double). This will cause statistical anomalies if used later to scale into a large integer range.

Splitting the Baby

For one of my hardware projects, I decided to try doing things a bit differently. Instead using a single repository, I decided to split it into two - one containing Firmware and other containing Hardware.

Since repository already had those as a subdirectories, I though using --subdirectory-filter as recommended on GitHub would solve it all. Unfortunately, that left way too many commits not touching either of those files. So I decided to tweak procedure a bit.

I first removed all the files I didn’t need using --index-filter. On that cleaned-up state I applied --subdirectory-filter just to bring directory to the root. Unfortunately, while preserving tags was possible, it proved to be annoying enough to actually remove them all and manually retag all once done.

As discussed above, on the COPY of original repository we first remove all files/directories that are NOT Hardware and then we essentially move Hardware directory to the root level of newly reorganized repository.

git remote rm origin
git filter-branch --index-filter \
    'git rm -rf --cached --ignore-unmatch .gitignore LICENSE.md PROTOCOL.md README.md Firmware/' \
    --prune-empty --tag-name-filter cat -- --all
rm -Rf .git/logs .git/refs/original .git/refs/remotes .git/refs/tags
git filter-branch --prune-empty --subdirectory-filter Hardware main
rm -Rf .git/logs .git/refs/original
git gc --prune=all --aggressive
git log --pretty --graph

With Hardware repository sorted, I did exactly the same process for Firmware with the new COPY of original repository, only changing the directory names.

git remote rm origin
git filter-branch --index-filter \
    'git rm -rf --cached --ignore-unmatch .gitignore LICENSE.md PROTOCOL.md README.md Hardware/' \
    --prune-empty --tag-name-filter cat -- --all
rm -Rf .git/logs .git/refs/original .git/refs/remotes .git/refs/tags
git filter-branch --prune-empty --subdirectory-filter Firmware main
rm -Rf .git/logs .git/refs/original
git gc --prune=all --aggressive
git log --pretty --graph

Once I got two repositories, it was easy enough to combine them. I personally love using subtrees but submodules have their audience too.

Using Null in the Face of CS0121

Sometime you might want to use null as a parameter but you get the annoying CS0121.

The call is ambiguous between the following methods or properties…

One could just adjust constructor but that would be an easy way out.

Proper way would be to use default operator. For example, if you want to use null string, you can use something like default(string):

var dummy = new Dummy(default(string));

If we’re using nullable reference types, just add question mark:

var dummy = new Dummy(default(string?));

And this simple trick selects the correct constructor every time.

Parsing GZip Stream Without Looking Back

Some files can exist in two equivalent forms - compressed and uncompressed. One excellent example is .pcap. You can get it as standard .pcap we all know and love but it also comes compressed as .pcap.gz. To open a compressed file in C#, you could pass it to GZipStream - it works flawlessly. However, before doing that you might want to check if you’re dealing with compressed or uncompressed form.

Check itself is easy. Just read first 2 bytes and, if they’re 0x1F8B, you’re dealing with a compressed stream. However, you just consumed 2 bytes and simply handing over file stream to GZipStream will no longer work. If you are dealing with file on a disk, just seek backward and you’re good. But what if you are dealing with streaming data and seeking is not possible?

For .pcap and many more transparently compressed formats, you can simply decide to skip into bread-and-butter of encryption - deflate algorithm. You see, GZip is just a thin wrapper over deflate stream. And quite often it only has a fixed size header. If you move just additional 8 bytes (thus skipping a total of 10), you can use DeflateStream and forget about “rewinding.”

Wanna see example? Check constructor of PcapReader class.

SignTool and Error -2146869243/0x80096005

As I was trying out my new certificate, I got the following error:

SignTool Error: An unexpected internal error has occurred.
Error information: "Error: SignerSign() failed." (-2146869243/0x80096005)

Last time I had this error, I simply gave up and used other timeserver. This time I had a bit more time and wanted to understand from where the error was coming. After a bit of checking, I think I got it now. It’s the digest algorithm.

SignTool still uses SHA-1 as default. Some servers (e.g. timestamp.digicert.com) are ok with that. However, some servers (e.g. timestamp.comodoca.com and timestamp.sectigo.com) are not that generous. They simply refuse to use weak SHA-1 for their signature.

Solution is simple - just add /td sha256 to the list of codesign arguments.

Merging Two Git Repositories

As I went onto rewriting QText, I did so in the completely new repository. It just made more sense that way. In time, this new code became what the next QText version will be. And now there’s a question - should I still keep it in a separate repository?

After some thinking, I decided to bring the new repository (QTextEx) as a branch in the old repository (QText). That way I have a common history while still being able to load the old C# version if needed.

All operations below are to executed in the destination repository (QText).

The first step is to create a fresh new branch without any common history. This will ensure Git doesn’t try to do some “smart” stuff when we already know these repositories are unrelated.

git switch --discard-changes --orphan ^^new^^

This will leave quite a few files behind. You probably want to clean those before proceeding.

The next step is to add one repository into the other. This can be done by adding remote into destination, pointing toward the source. Remote can be anywhere but I find it easiest if I use it directly from my file system. After fetching the repository, a simple merge is all it takes to get the commits. Once that’s done, you can remove the remote.

git remote add ^^QTextEx^^ ^^../QTextEx^^
git fetch --tags ^^QTextEx^^
git merge --ff --allow-unrelated-histories ^^QTextEx^^/main
git remote remove ^^QTextEx^^

After a push, you’ll see that all commits from the source repository are now present in destination too.


PS: Do make backups and triple verify all before pushing it upstream.

PPS: This is a permanent merge of histories. Consider using subtree if you want to keep them separate.

PPPS: Things get a bit more complicated if you want to transfer multiple branches from the other repository.

Background Worker in Qt

Coming from C#, Qt and its C++ base might not look the friendliest. One example is ease of BackgroundWorker and GUI updates. “Proper” way of creating threads in Qt is simply a bit more involved.

However, with some lambda help, one might come upon solution that’s not all that different.

#include <QFutureWatcher>
#include <QtConcurrent/QtConcurrent>
``…``
QFutureWatcher watcher = new QFutureWatcher<bool>();
connect(watcher, &QFutureWatcher<bool>::finished, [&]() {
    //do something once done
    bool result = watcher->future().result();
});
QFuture<bool> future = QtConcurrent::run([]() {
    //do something in background
});
watcher->setFuture(future);

QFuture is doing the heavy lifting but you cannot update UI from its thread. For that we need a QFutureWatcher that’ll notify you within the main thread that processing is done and you get to check for result and do updating then.

Not as elegant as BackgroundWorker but not too annoying either.

Stripping Diacritics in Qt

As someone dealing with languages different than English, I quite often need to deal with diacritics. You know, characters such as č, ć, đ, š, ž, and similar. Even in English texts I can sometime see them, e.g. voilà. And yes, quite often you can omit them and still have understandable text. But often you simply cannot because it changes meaning of the word.

One place where this often bites me is search. It’s really practical for search to be accent insensitive since that allows me to use English keyboard even though I am searching for content in another language. Search that would ignore diacritics would be awesome.

And over the time I implemented something like that in all my apps. As I am building a new QText, it came time to implement it in C++ with a Qt flavor. And, unlike C#, C++ was not really built with a lot of internationalization in mind.

Solution comes from non-other than now late Michael S. Kaplan. While his blog was deleted by Microsoft (great loss!), there are archives of his work still around - courtesy of people who loved his work. His solution was in C# (that’s how I actually rembered it - I already needed that once) and it was beautifully simple. Decompose unicode string, remove non-spacing mark characters, and finally combine what’s left back to a unicode string.

In Qt’s C++, that would be something like this:

QString stripDiacritics(QString text) {
    QString formD = text.normalized(QString::NormalizationForm_D);

    QString filtered;
    for (int i = 0; i &lt; formD.length(); i++) {
        if (formD.at(i).category() != QChar::Mark_NonSpacing) {
            filtered.append(formD.at(i));
        }
    }

    return filtered.normalized(QString::NormalizationForm_C);
}