For most of the times, debugging is easy – a silly logical error or an oversight, quick to correct and move on but some problems are hard – because you don’t know where it is wrong. The bigger the involved stack is, the higher are your chances of getting stuck at them. In such cases, tools like strace can be of great aid as they can help you give a hint about where things are getting screwed up.
Recently, I faced an issue at my company where navigating beyond the dashboard page resulted in navigation getting stuck.
The Problem
First a bit about our development environment – we use Vagrant to have a complete isolated environment which is similar to what we use in production. To provision our VM we use Puppet configuration management (although, we have been poor to keep it up to date with incorporation of newer technologies).
Now, we started facing a problem in which the PHP-FPM process just get hung and won’t process any other request for a long time. It would do load after a while, having the same problem when navigating to another page. “Annoying” would be an understatement to describe how exasperating this problem was.
The only solution was to run sudo service php5-fpm restart inside the VM everytime it happened (resulting in 503 Bad Gateway) and refresh the page again. Certainly, not a worthwhile solution but short of solutions, we aliased the command to have a quick fix.
Since the bug was not reproducible in production environment, our first suspect was vagrant or the local configuration of FPM. I tried, without success, to tweak some buffers and other settings of FPM and nginx but the problem was still persistent.
strace to the rescue
strace is a little linux utility whose most common use case is to prints a list of system calls made by the program. strace is an amazing tool for system admins that can aid in debugging in many cases, like diagnosing a problem of a running program which we will do in this case.
PHP-FPM uses master-slave model where the master process uses the workers to process the incoming requests. Since the actual processing of request can take place in any of the worker processes, we are going to use pidof command in combination with strace (source).
strace -f $(pidof php5-fpm | sed 's/\([0-9]*\)/\-p \1/g')After running the command and reproducing the problem, I noticed some peculiar timeouts in poll system calls. Let’s cut out the noise from output and just trace those calls instead.
strace -f $(pidof php5-fpm | sed 's/\([0-9]*\)/\-p \1/g') -epollThis gives the following output,
[pid  2443] poll([{fd=10, events=POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)
[pid  2443] poll([{fd=10, events=POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)
[pid  2443] poll([{fd=10, events=POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)
[pid  2443] poll([{fd=10, events=POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)
[pid  2443] poll([{fd=10, events=POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)
[pid  2443] poll([{fd=10, events=POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)
[pid  2443] poll([{fd=10, events=POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)This can possible be the issue but what does fd 10 refer to in this? Let’s run lsof to know about it.
lsof -p 2443This will give us list of all files opened by the process (including the sockets):
php5-fpm 4373 www-data    0u  IPv4              16196      0t0     TCP localhost:9000 (LISTEN)
php5-fpm 4373 www-data    1u   CHR                1,3      0t0    5788 /dev/null
php5-fpm 4373 www-data    2u   CHR                1,3      0t0    5788 /dev/null
php5-fpm 4373 www-data    3ur  REG              252,0        0 4587530 /tmp/.ZendSem.LfeVSH (deleted)
php5-fpm 4373 www-data    4u  IPv4              55738      0t0     TCP localhost:9000->localhost:50475 (ESTABLISHED)
php5-fpm 4373 www-data    5uW  REG              252,0     2534 3805440 /var/lib/php5/sessions/sess_tmh0njqp2aid0jcfacsujdls75
php5-fpm 4373 www-data    6u  0000                0,9        0    5763 anon_inode
php5-fpm 4373 www-data    7u  unix 0xffff88007a2e1a00      0t0   55715 socket
php5-fpm 4373 www-data    8u  unix 0xffff88007a2e3740      0t0   55716 socket
php5-fpm 4373 www-data    9u  IPv4              55719      0t0     TCP localhost:49941->localhost:11210 (ESTABLISHED)
php5-fpm 4373 www-data   10u  IPv4              55823      0t0     TCP 10.0.2.15:49535->10.0.0.135:9200 (SYN_SENT)The last entry with fd 10 is what is of our interest. So the problem seems to be related to socket connection to 10.0.0.135.
Diagnosis
We had started using Elastic Search a while ago in our application and chose to have centralized local server for the development environment. The ES server powered our reporting. After a while, the majority of our work was being done on the other side of our app, so we lost focus over the reporting part. After certain while, we forgot about our local server and stopped using it altogether since we weren’t really working with it.
Whenever the dashboard opened, an Ajax request used to query ES Sever using the ElasticSearch PHP library. The problem as it happens was, this resulted in PHP5-FPM continuously polling the ES Server to get a reply. Empirically, she worker process got hung in this loop, and it failed to respond to any other request unless it was restarted.
Is it a bug? Well, I am not sure but I will try to dig more into it.
Lesson
Our valuable lesson in this ordeal was that the local environment needs to be an absolute replica of one being used in production and setup of your work environment must be completely automated using up-to-date scripts. A common mistake is to just rely on vaguely documented commands which pose of chance of missing something in the process.