Debugging C/C++ based Python Code

When things go south in Python it usually gives us a sensible stack trace that helps us track down the issue. However, when you are invoking Python code that depends on C/C++ library and things go awry deep inside, you are often left with merely annoying “segment fault” or “core dumped” and nothing really useful.

If you are able to get your hands on the C/C++ source code, there is a quick and easy fix to this. Compile your C code with debug flag (e.g. with CMAKE you add -DCMAKE_BUILD_TYPE=Debug which will build your code in debug mode. How do you know that? If you use objdump --source <your_built_binarty> you will find C source code embedded in the sea of assembly code.

Now you can use to gdb --args python <arg1> <arg2> <arg3> to start gdb, type r to kick off your program. When it stops at a crash, type bt to see the wonderful stack trace that you are craving for.

Caveat: there is no silver bullet, stack trace does not guarantee your issue will be found easily. In my case, the crash does not happen at memory access violation. It manifests much later. So be cautious when you are using C!

ODR of C++

Recently in my project I bumped into a weird issue – we vend a framework that is going to be used by another team’s daemon, but as soon as their daemon is linked to our framework, even without calling into our framework, it stops working in a bizarre way. The daemon would launch and run, but soon dies mysteriously, leaving no trace to debug. We had multiple hypotheses, including extra memory usage causing the daemon to be killed, and security violations caused by linking to a new framework. Several days were spent on this issue fruitlessly, until finally I realized that both our framework and the consumer daemon were built with a common set of C++ source files. Per discussion with a coworker, this might trigger calling unintended implementations based on C++’s one definition rule (ORD). I modified the common source code used by our framework by moving them to a different namespace, and the issue is immediately gone. This reminds me once again how tricky C++ can be in areas that are easily neglected.

To illustrate the issue we encountered with a simple example, consider a source file a.cpp which is

#include <iostream>

void say() {
    std::cout << "hi, a is saying" << std::endl;
}

and main.cpp which contains the main function and utilizes the function defined in a.cpp

void say();

int main() {
    say();
    return 0;
}

We can build the binary using the following command and run without issues

% g++ -dynamiclib -fPIC a.cpp -o liba.so; g++ -L. -la main.cpp -o main
% ./main
hi, a is saying

however, if we now have a source file called b.cpp which redefines the symbol say in a different way, as

#include <iostream>

void say() {
    std::cout << "hi, b is saying" << std::endl;
}

and we build it into a shared library

% g++ -dynamiclib -fPIC b.cpp -o libb.so

Now when you rebuild main.cpp again by linking with libb.so too without actually using it. You will find that the runtime behavior is changed.

% g++ -L. -lb -la main.cpp -o main
% ./main
hi, b is saying

Why is this? In hindsight, it is not hard to explain at all with C++’s ODR: both library a and library b expose the same (mangled) symbol __Z3sayv which is compiled from the say function. And when it is invoked by the main binary, one implementation will be chosen and the other will be shadowed. Which one would be chosen? It seems that on my machine, whichever appears first in the linking library order will be chosen. A more detailed explanation can be found at https://en.wikipedia.org/wiki/One_Definition_Rule

 nm -gU liba.so libb.so

liba.so:
0000000000003028 T __Z3sayv
0000000000003bf0 T __ZNSt3__111char_traitsIcE11eq_int_typeEii
0000000000003c18 T __ZNSt3__111char_traitsIcE3eofEv
0000000000003318 T __ZNSt3__111char_traitsIcE6lengthEPKc
0000000000003124 T __ZNSt3__124__put_character_sequenceIcNS_11char_traitsIcEEEERNS_13basic_ostreamIT_T0_EES7_PKS4_m
00000000000030cc T __ZNSt3__14endlIcNS_11char_traitsIcEEEERNS_13basic_ostreamIT_T0_EES7_
0000000000003058 T __ZNSt3__1lsINS_11char_traitsIcEEEERNS_13basic_ostreamIcT_EES6_PKc

libb.so:
0000000000003014 T __Z3sayv
0000000000003110 T __Z7b_startv
0000000000003bf0 T __ZNSt3__111char_traitsIcE11eq_int_typeEii
0000000000003c18 T __ZNSt3__111char_traitsIcE3eofEv
0000000000003318 T __ZNSt3__111char_traitsIcE6lengthEPKc
0000000000003124 T __ZNSt3__124__put_character_sequenceIcNS_11char_traitsIcEEEERNS_13basic_ostreamIT_T0_EES7_PKS4_m
00000000000030b8 T __ZNSt3__14endlIcNS_11char_traitsIcEEEERNS_13basic_ostreamIT_T0_EES7_
0000000000003044 T __ZNSt3__1lsINS_11char_traitsIcEEEERNS_13basic_ostreamIcT_EES6_PKc

The issue might happen more often than people realize, as in most cases when you redefine a symbol in multiple libraries, it is caused by duplicated (same) source files. So whichever implementation is used does not impact runtime behavior, while in our case, we modified the implementation based on our needs in different projects, leading to the phenomenon that the program can run, but not in the expected way!

The Four

I just finished reading “The Four”, written by NYU business school professor Scott Galloway. Overall, I am quite impressed by the insights offered in this book. They are in-depth and thought-provoking, well deserved to be a best-seller. The language used in this book is sometimes explicit, but witty through-out. There are a few points that I feel the urge to make some commentary on.

  1. The book emphasizes heavily on the market value for the four companies. Market value is a popular metric, but also subject to high volatility and oftentimes hype driven. I felt kind of bored after reading the astronomical numbers a few times, especially when a company’s market value is compared to a country’s annual GDP (I don’t even think it makes sense in semantics). After all, market value is a medal won after some remarkable achievement is made, not the reason the achievement happened in the first place.
  2. The author was ever in upper management of NYT and his personal perspective on how Google destroyed the newspaper’s ad business is eye-opening. However, his proposal to not let Google crawl the newspaper’s content, plus forming a content producer alliance of media to resist the invasion of Google, would not have saved the industry in my opinion. Google’s technology to collect comprehensive web results and display them to users fastly is the killer feature. NYT’s content is valuable, but too easily dwarfed by the content produced by the masses in a gig economy. Turning down Google prevents it from eating a slice of your pie, but it also means no pie may be delivered to you at all and you will be the first to starve to death.
  3. The analogy of Apple to a luxury brand like Hermes is something I totally don’t agree with. Also, Apple does not win by appealing to the sex desire of people, like the author claims. In my understanding a luxury brand is something you still crave to own at an unreasonable expense even though there are functionally equivalent substitutes at a normal price, and Apple apparently doesn’t fall under this category in the present world. Outsiders often attribute Apple’s success to its aesthetic design; very few mention that the design boosts the product’s durability, longevity and ease of use. Exemplifying the company’s high margin by subtracting iPhone’s material cost to sales price is a cliche that should not be in a book written by a business school professor. It would be much more interesting to discuss why Apple is so good at adding value to commodity electronic parts and squeezing vendors to get the most cutting edge supplies.
  4. I had mixed feelings on the book’s stance of supporting breaking down the four to allow benign competition and increase overall shareholder value across the society. I don’t doubt the benefits. However, there is a reason the big four happened in the U.S. but not elsewhere. Punishment based on size seems unfair and could undermine America’s competitiveness in the long run.

Overall, I found that book pretty fun to read despite having a snobbish title and cover (understandable, as this book might not have been circulated and reach masses without such a title and cover). Even if you have some prior understanding of the Four, it makes your ideas better crystallized and stimulates more thinking.

Do internal Wiki right

One inevitable thing working in tech is writing and reading Wiki. If your product depends on some other team’s products, you most likely will need to read other teams’ Wiki. But a lot of Wiki is hard to follow and loses focus easily among unimportant clutter.

One common pitfall is providing seemingly one-click solutions without explaining the nature of the stuff you offer. Too often we see words like “If you are an XYZ typed service just add ABC as your dependency and then DEF in your configuration file”. This seemingly fool proof guide decays over time as your product and your dependent product evolve, and eventually becomes so obsolete that it is useless. Another disadvantage is you will need to enumerate all types of customers, which is extremely hard if not impossible to achieve. Even if you can cover all types of use cases today, there are going to be emerging new use scenarios that you cannot keep up with. A much better approach is do not assume that your customers are fools and provide essential information with a few sample use cases (your tech colleagues are not that dumb after all). For instance, if you are a service provider, what is your endpoint and what network it belongs to; what is your protocol; how are you authenticating; what input data format do you expect and what data format do you return. Even if you only vend a client library which is super awesome and can hide such details underneath, you still should provide enough information that when your customers’ execution environment changes and things don’t work as expected, they know what the issue could be; and when your vended software have bugs, which is 100% guaranteed, your customer can pinpoint it quickly without having to waste time on debugging for you.

Another disappointing thing is offering too little information in the Wiki body, but adding lengthy FAQs. For sure, addendums are hard to avoid when you are describing complex things. But adding a disproportionately large FAQ section reflects that you cannot write logically to teach your audience in a heuristic manner. Some FAQs, hilariously, contain questions such as “why we built this thing” and “what are the key parameters of our product”, which should belong to the overview or introduction in the first place. Finding valuable information in long FAQs is a game of chance. If there is highly valuable information in the FAQ section they probably better belong somewhere else in the Wiki body; other less critical information that lies in FAQs should be able to be safely omitted without negatively impacting understanding of your product. I often see intimidating FAQ sections that contain over 100 questions, and the only way I can make use of it is searching a keyword with an exact match. If I cannot find such a match, I close the browser tab with a sigh.

A third type of bummer is poorly maintained Wiki. We often hear the doctrine that “whoever reads the Wiki and finds inaccurate and outdated information should update it”, which is good intent but practically causes the Wiki to deteriorate over time if no regular maintenance is applied. Crowd-sourcing based Wiki is error prone because it is seldom peer reviewed, and chaotic in style because a lot of people don’t check if their newly added one-liner is coherent with the rest of the content. Wiki should be an authoritative source of information, so it needs to be constantly audited, improved and sometimes completely refactored, by a designated owner.

With all above being said, I certainly acknowledge that writing high quality Wiki in the tech industry is very hard. Things we document mutate quickly and we are all busy. Devoting time in writing internal Wiki takes a lot of effort but often receives little credit from management. But from a strategic perspective, there should be no excuse for under-investing into this type of task. Keeping your internal documentation as good as your external documentation would save a lot of wasted effort and greatly improve your teams’ productivity.

Encryption/Decryption

I was playing with AWS KMS CLI and to some encryption and decryption. It took me a little while to get a reciprocal command pair to work. As a starter, here is what it looks like

$ aws kms encrypt --region eu-west-1 --key-id f627d4c3-926a-4188-8dc8-64114bd7d7ae --plaintext HiIAmPlaintext  --output text --query CiphertextBlob | base64 --decode > ciphertext.txt
$ aws kms decrypt --region eu-west-1   --ciphertext-blob fileb://ciphertext.txt --output text --query Plaintext | base64 --decode<br> HiIAmPlaintext

By following the AWS manual it is not difficult to make sense of these commands, but noticing that there are Base64 decoding happening both in the encryption and decryption direction, it is a little odd.

The raw output of encrypt is in Base64, which is quite understandable and reasonable, as we often hope to copy-and-paste the result or transmit it over HTTP. But, why do we need to provide input in binary when doing decrypt? That causes the hassle of converting the encryption result into binary for later decryption use.