Preparing for the SRE technical interview

2019-02-16

TechLinux

23 minutes read (About 3411 words)

It is always fun and interesting to review our knowledges everytime we prepare for a technical interview. As time go, we explore several different new technologies that give us new things to learn and use. We might forget the “basic things” which we use less. In this post, I am gonna list down the fundamental knowledges that we definitely need for the SRE (Site Reliability Engineer) interview. Ah, if you are a DevOps, System Engineer or System Administrator this post is also useful for you :-).

Operation System

The surveys show that Linux more pupular than Windows in server’s operation system market share. Many enterprise use Linux today, so definitely Linux is an important operation system every SRE should know. Even you are following Microsoft technologies, it is good to know about Linux and its open source tools. I don’t use Windows much so I am only gonna write about Linux and its related tooling and techniques :-).

Learning materials

There 2 books that I highly recommend you to read:

The Linux Programming Interface: A Linux and UNIX System Programming Handbook by Michael Kerrisk
Advanced Programming in the UNIX Environment, 3rd Edition by W. Richard Stevens and Stephen A. Rago

However, reading those big books will take time. You can also take a look at this blog for a quick tour about the Linux Kernel Internals.

Linux Process

What is difference between process and thread ?

Process: Each process provides the resources needed to execute a program. A process has a virtual address space, executable code, open handles to system objects, a security context, a unique process identifier, environment variables, a priority class, minimum and maximum working set sizes, and at least one thread of execution. Each process is started with a single thread, often called the primary thread, but can create additional threads from any of its threads.

Thread: A thread is an entity within a process that can be scheduled for execution. All threads of a process share its virtual address space and system resources. In addition, each thread maintains exception handlers, a scheduling priority, thread local storage, a unique thread identifier, and a set of structures the system will use to save the thread context until it is scheduled. The thread context includes the thread’s set of machine registers, the kernel stack, a thread environment block, and a user stack in the address space of the thread’s process. Threads can also have their own security context, which can be used for impersonating clients.

In short, the typical difference is that threads (of the same process) run in a shared memory space, while processes run in separate memory spaces.

You can also read more About Processes and Threads on MSDN.

What is System call ?

System call is a fundamental interface between a process and operation system. System calls are usually made when a process in user mode requires access to a resource. Then it requests the kernel to provide the resource via system calls.

There are mainly 5 types of system call:

Process Control: Working with the process such as creation, termination, etc. Sample system calls fork(), exit(), wait(), etc.
File Management: Working with files on the system with operations such as open, read, write. close etc. Sample system calls open(), read(), write(), close(), etc.
Device Management: Working with the hardware devices with operation such as read, write. Sample system calls ioctl(), read(), write(), etc.
Information Maintenance: Responsible for transferring information between user application and operating system. Sample system calls getpid(), alarm(), sleep(), etc.
Communication: Responsible for interprocess communication (IPC). Sample system calls pipe(), shmget(), nmap(), etc.

How is a process created with `fork` system call ?

Fork system call use for creates a new process, which is called child process, which runs concurrently with process (which process called system call fork) and this process is called parent process. After a new child process created, both processes will execute the next instruction following the fork() system call. A child process uses the same pc (program counter), same CPU registers, same open files which use in the parent process.

If the fork() runs successfully, it will return 2 values which are 0 in the child process and x = child process PID in the parent process. If it is unsuccessful, a negative value will be returned.

Sample c program to fork new processes:

#include <stdio.h>
#include <sys/types.h>
int main()
{
    fork();
    fork();
    printf("hello world\n");
    return 0;
}

The above code will print hello world 4 times as we have 4 processes in total. They are created as following description.

fork ();   // Line 1
fork ();   // Line 2

      L1       // There will be 1 child process
    /    \     // created by line 1.
  L2      L2   // There will be 2 child processes

=> Total number of process = 2^n

Explain few process’s states

In Linux, every single process has its own state. From the shell, you can run ps aux command and checkout the STAT column to see the process’s state. They are:

D: Uninterruptible sleep. A processes that cannot be killed or interrupted with a signal, usually due to I/O.
R: Running or runnable (on run queue). It is just waiting for CPU resouce to process it.
S: Interruptible sleep. It is waiting for an event to complete, such as keyboard input.
T: Stopped. A process that has been suspended / stopped.
W: Paging (not valid since the 2.6.xx kernel).
X: Process is dead. We should never been seen it.
Z: Defunct (zombie) process. The process is terminated but not reaped by its parent.
<: High-priority (not nice to other users).
N: Low-priority (nice to other users).
L: Has pages locked into memory (for real-time and custom IO).
s: Process is in a session leader.
l: Process is in multi-threaded mode.
+: Process is in the foreground process group.

Why does a process have Defunct (Zombie) state ?

It is because the parent process doesn’t call wait() so the child process’s entry still exists in the process table and stay in Zombie state. We might find this in the applications which are not programmed properly.

Actually every child process will always first becomes a zombie before being removed form the process table. The parent process reads the exit status of the child process which reaps off the child process entry from the process table. However, it happens quickly so we can not notice about that when looking at STAT column in ps aux command output.

Simple c program to create zombie process

#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
int main()
{
    // Child PID is returned in parent process
    pid_t child_pid = fork();
    // Parent process
    if (child_pid > 0)
        sleep(100);
    // Child process
    else
        exit(0);
    return 0;
}

What is Linux signals ?

Simply speaking, signals are the way to delivery asynchronous events to the application. For instance, We press CTRL+C while running an application, a singal will be sent to the application’s process to terminate it.

In Linux, every signal has a name that begins with characters SIG and a value. Following is the standard signals:

Signal     Value     Action   Comment
──────────────────────────────────────────────────────────────────────
SIGHUP        1       Term    Hangup detected on controlling terminal or death of controlling process
SIGINT        2       Term    Interrupt from keyboard
SIGQUIT       3       Core    Quit from keyboard
SIGILL        4       Core    Illegal Instruction
SIGABRT       6       Core    Abort signal from abort(3)
SIGFPE        8       Core    Floating-point exception
SIGKILL       9       Term    Kill signal
SIGSEGV      11       Core    Invalid memory reference
SIGPIPE      13       Term    Broken pipe: write to pipe with no readers; see pipe(7)
SIGALRM      14       Term    Timer signal from alarm(2)
SIGTERM      15       Term    Termination signal
SIGUSR1   30,10,16    Term    User-defined signal 1
SIGUSR2   31,12,17    Term    User-defined signal 2
SIGCHLD   20,17,18    Ign     Child stopped or terminated
SIGCONT   19,18,25    Cont    Continue if stopped
SIGSTOP   17,19,23    Stop    Stop process
SIGTSTP   18,20,24    Stop    Stop typed at terminal
SIGTTIN   21,21,26    Stop    Terminal input for background process
SIGTTOU   22,22,27    Stop    Terminal output for background process

From the Linux shell, to send a signal to a running process, we can use kill command. For instance kill -9 14112 will send SIGKILL signal to force killing the process that has PID 14112.

Linux File systems

What is an inode ?

An inode is a data structure in Unix that contains metadata about a file. These following inode attributes which may be retrieved by the stat system call:

mode
owner (UID, GID)
size
atime, ctime, mtime
acl’s
blocks list of where the data is

Note: The filename is present in the parent directory’s structure, not in the inode.

What is the difference between soft link and hard link ?

Hard link shares the same inode number as the source while the Soft link uses a different inode number.

As a result:

If the source file is moved, the Soft link will be broken but Hard link will still working fine.
The Hard link is only valid on the same filesystem while Soft link is not.

Networking

ARP

ARP stands for Address Resolution Protocol. When you try to ping an IP address on your local network, say 192.168.1.1, your system has to turn the IP address 192.168.1.1 into a MAC address. This involves using ARP to resolve the address, hence its name.

If the IP address is not found in the ARP table, the system will then send a broadcast packet to the network using the ARP protocol to ask who has 192.168.1.1. Because it is a broadcast packet, it is sent to a special MAC address that causes all machines on the network to receive it. Any machine with the requested IP address will reply with an ARP packet that says I am 192.168.1.1, and this includes the MAC address which can receive packets for that IP.

To broadcast the ARP reply packets to update machine’s ARP table in the local network, we can use arping command. For instance arping -A -I eth0 192.168.1.123.

TCP/IP

This topic is more about theory. However the more you understand about it, the more you know how to work with network protocols, troubleshooting and programming.

Understand the basic of TCP/IP concept. How does it compare to OSI model ?
Understand how TCP and UDP work and their design.

DHCP

DHCP stands for Dynamic Host Configuration Protocol, is a network management protocol used on UDP/IP networks whereby a DHCP server dynamically assigns an IP address and other network configuration parameters to each device on a network so they can communicate with other IP networks.

DHCP client and server communicate over UDP with following steps:

HOST (68/udp)                  SERVER (67/udp)
    -----------DHCP discover---------->
    <----------DHCP offer--------------
    -----------DHCP request----------->
    <----------DHCP ACK----------------

DNS

DNS stands for Domain Name System, is a hierarchical and decentralized naming system for computers, services, or other resources connected to the Internet or a private network.

To understand how does DNS works, you can take a look at this awesome comic website. Basically, there are following DNS query steps after you hit example.com from your browser address bar:

Step 1: The browser check your machine’s DNS cache, hosts file.
Step 2: If local cache is not available, your machine will ask the configured DNS servers. If the DNS server has the cache for example.com, it will return the result, otherwise
Step 3: The DNS server queries to an Authoritive Root Name Server which gives the list of authoritive DNS server of each top-level domain (.com, .org, .net, etc.).
Step 4: Authoritive Top-Level Domain (TLD) Name Server gives the list of authoritive Name Server (NS) for the example.com domain (e.g. ns1.example.com, ns2.example.com)
Step 5: Query to that Name Servers to get the IP address for example.com.

Note: In the step 4, the Name Server is a sub-domain of the domain that we want to resolve. This cause a problem with circular resolving. So in this case, in the Name Server DNS response, there must be an addional section which includes the IP address for those name servers. It is call glue record.

Routing

Understand how routing protocols work help you to troubleshoot network issue easier. Although we are not going to become a Network Engineer, it is good to know the basic of routing protocols and when / why we should use them.

There are routing protocols we should care:

The Static routing is the simplest routing method that supported in Linux by default. I highly recommend you to learn and understand how does it work.

In Linux, the command route -n will show the main routing table as it was configured by a default ip rule. You also should take a look at Linux Policy Based Routing to learn about routing table and rule.

Explain how does traceroute command work

Traceroute is a program that shows us the route taken by the packets in a network. Understand how does it work help us to examine the problems that might cause network communication issue between hosts in our network.

Following are steps happen when we run traceroute:

Step 1: Traceroute creates a UDP packet from the source to destination with a TTL = 1. This UDP packet reaches the first router (hop) where the router decrements the value of TTL by 1, thus making our UDP packet’s TTL = 0 and hence the packet gets dropped.
Step 2: Noticing that the packet got dropped, the router sends an ICMP Time exceeded message back to the source.
Step 3: Traceroute makes a note of the router’s address and the time taken for the round-trip.
Step 4: Traceroute sends two more packets in the same way to get an average value of the round-trip time. Usually, the first round-trip takes longer than the other two due to the delay in ARP finding the physical address, the address stays in the ARP.
Step 5: The steps will be repeated again and again until it reaches the destination. During these steps, the TTL is incremented by 1 everytime its UDP packet is sent to next router (hop).
Step 6: Once the destination is reached, ICMP Time exceeded message will NOT sent back this time because the destination has already been reached. An ICMP Destination Unreachable message will be sent back instead. This is because Traceroute specified a destination port that not usually used by UDP so the destination host drops it after verifying the ICMP packet header.
Step 7: Traceroute program receives Destination Unreachable message and understands that the destination has been reached. Then the result is printed to user.

Tools & Troubleshooting

Troubleshooting is a large topic. The are several scenarios we might will be asked during the interview. I will try to list down few useful tools that could help for the troubleshooting interview below.

dstat

dstat is a powerful tool for generating Linux system resource statistics. With this single tool you can quickly monitor the system resource utilization such as cpu, memory, disk, network, etc.

$ dstat
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  1   0  99   0   0   0|  95k   15k|   0     0 | 960B  936B|  84   171
  2   0  98   0   0   0|4096B    0 |3080B 6057B|   0     0 | 124   260
  1   1  88  10   0   0|  92k   19M| 770B 1448B|   0     0 | 169   335
  1   0  99   0   0   0|   0     0 |1555B 2510B|   0     0 |  92   192
  0   1  99   0   0   0|  76k    0 |1224B 1920B|   0     0 | 109   283
  0   0 100   0   0   0|   0     0 | 280B  876B|   0     0 |  88   187

strace

opensnoop

netcat / telnet

Send HTTP request to a web server to retrieve the HTML content

$ telnet 139.99.99.54 80
Trying 139.99.99.54...
Connected to ndk.name.
Escape character is '^]'.
GET / HTTP/1.1
Host: ndk.name

There are 3 important inputs we need to specify:

GET / HTTP/1.1: We want to get the index page on / using HTTP version 1.1
Host: ndk.name: A web server might serves several websites, using Host header will help us to get the needed website.
2 more spaces at the end.

netstat

netstat is helpful to show the server network connection, routings, interface statistics, etc. It is usually used to find the listening ports on the server. For instance

1	$ sudo netstat -tulnp

The above command will list down all the processes are listening on UDP and TCP ports on the server. The p option is to translate the process name from its PID. If you don’t need it, sudo is no need in the command.

tcpdump / wireshark

ngrep

ngrep helps you to search and filter the network packets going through your network interface. It has ability to look for a regular expression in the payload of the packet, and show the matching packets on a screen or console. Unlike Tcpdump and Wireshark, ngrep provides a query syntax which is more readable and easy to understand.

Sample queries:

$ ngrep -q 'port 25'
$ ngrep -q 'HTTP'
$ ngrep -q 'HTTP' 'udp'
$ ngrep -q 'HTTP' 'host 172.16'
$ ngrep -q 'HTTP' 'dst host 172.16'
$ ngrep -q 'HTTP' 'src host 172.16'
$ ngrep -d any "domain*.com" port 80
$ ngrep -d any -i "password|username" port 80

openssl

Back to the example of using telnet to retrieve a website. If that website is running on HTTPS, we have to be able to work with TLS/SSL protocol so openssl is what we need.

$ openssl s_client -connect 139.99.99.54:443
<the certificate will be shown here>
---
GET / HTTP/1.1
Host: ndk.name

We can also use openssl to show the remote server certificate

1	$ openssl s_client -showcerts -servername ndk.name -connect 139.99.99.54:443

perf

Perf is a performance analysis tool for Linux. If above strace tool can find the system calls executed by a process, perf can do the same, and more. Unlike strace, perf won’t slow down our program during the profiling.

On a high load machine which has CPU utilization at 100% and we want to quickly figure out the problem. Following command can give the overhead overview.

1	$ sudo perf top

If we want use perf for a specific program, we can use:

1	$ sudo perf record <command>

Then press CTRL + C to end the profiling. A perf.data file will be saved in the current directory. To see the result, run:

1	$ sudo perf output

We can always use perf to look at a specific system call with -e option. To list down all supported system calls, run:

1	$ sudo perf list

Following example looks at the sys_enter_connect system call to find the functions that sending network traffic out.

1	$ sudo perf record -e syscalls:sys_enter_connect -ag

flamegraphs

Programming

Memory management: Stack vs Heap

Stack: The stack stores values in the order it gets them, and removes the values in the opposite order. This is referred to as last in, first out. The stack is fast because of the way it accesses the data. It never has to search for a place to put new data or a place to get data from because that place is always the top. All data on the stack must take up a known, fixed size.

Heap: The heap is less organized. When we put data onto the heap, we ask for some amount of space. The operating system finds an empty spot somewhere in the heap that is big enough, marks it as being in use, and returns to us a pointer, which is the address of that location. Accessing the data in the heap is slower than accessing the data on the stack because we have to follow a pointer to get there.

TO BE UPDATED…