It is always fun and interesting to review our knowledges everytime we prepare for a technical interview. As time go, we explore several different new technologies that give us new things to learn and use. We might forget the “basic things” which we use less. In this post, I am gonna list down the fundamental knowledges that we definitely need for the SRE (Site Reliability Engineer) interview. Ah, if you are a DevOps, System Engineer or System Administrator this post is also useful for you :-).
The surveys show that Linux more pupular than Windows in server’s operation system market share. Many enterprise use Linux today, so definitely Linux is an important operation system every SRE should know. Even you are following Microsoft technologies, it is good to know about Linux and its open source tools. I don’t use Windows much so I am only gonna write about Linux and its related tooling and techniques :-).
There 2 books that I highly recommend you to read:
- The Linux Programming Interface: A Linux and UNIX System Programming Handbook by Michael Kerrisk
- Advanced Programming in the UNIX Environment, 3rd Edition by W. Richard Stevens and Stephen A. Rago
However, reading those big books will take time. You can also take a look at this blog for a quick tour about the Linux Kernel Internals.
Process: Each process provides the resources needed to execute a program. A process has a virtual address space, executable code, open handles to system objects, a security context, a unique process identifier, environment variables, a priority class, minimum and maximum working set sizes, and at least one thread of execution. Each process is started with a single thread, often called the primary thread, but can create additional threads from any of its threads.
Thread: A thread is an entity within a process that can be scheduled for execution. All threads of a process share its virtual address space and system resources. In addition, each thread maintains exception handlers, a scheduling priority, thread local storage, a unique thread identifier, and a set of structures the system will use to save the thread context until it is scheduled. The thread context includes the thread’s set of machine registers, the kernel stack, a thread environment block, and a user stack in the address space of the thread’s process. Threads can also have their own security context, which can be used for impersonating clients.
In short, the typical difference is that threads (of the same process) run in a shared memory space, while processes run in separate memory spaces.
You can also read more About Processes and Threads on MSDN.
System call is a fundamental interface between a process and operation system. System calls are usually made when a process in user mode requires access to a resource. Then it requests the kernel to provide the resource via system calls.
There are mainly 5 types of system call:
- Process Control: Working with the process such as creation, termination, etc. Sample system calls
- File Management: Working with files on the system with operations such as open, read, write. close etc. Sample system calls
- Device Management: Working with the hardware devices with operation such as read, write. Sample system calls
- Information Maintenance: Responsible for transferring information between user application and operating system. Sample system calls
- Communication: Responsible for interprocess communication (IPC). Sample system calls
Fork system call use for creates a new process, which is called child process, which runs concurrently with process (which process called system call fork) and this process is called parent process. After a new child process created, both processes will execute the next instruction following the fork() system call. A child process uses the same pc (program counter), same CPU registers, same open files which use in the parent process.
If the fork() runs successfully, it will return 2 values which are
0 in the child process and
x = child process PID in the parent process. If it is unsuccessful, a negative value will be returned.
Sample c program to fork new processes:
The above code will print
hello world 4 times as we have 4 processes in total. They are created as following description.
fork (); // Line 1
In Linux, every single process has its own state. From the shell, you can run
ps aux command and checkout the
STAT column to see the process’s state. They are:
- D: Uninterruptible sleep. A processes that cannot be killed or interrupted with a signal, usually due to I/O.
- R: Running or runnable (on run queue). It is just waiting for CPU resouce to process it.
- S: Interruptible sleep. It is waiting for an event to complete, such as keyboard input.
- T: Stopped. A process that has been suspended / stopped.
- W: Paging (not valid since the 2.6.xx kernel).
- X: Process is dead. We should never been seen it.
- Z: Defunct (zombie) process. The process is terminated but not reaped by its parent.
- <: High-priority (not nice to other users).
- N: Low-priority (nice to other users).
- L: Has pages locked into memory (for real-time and custom IO).
- s: Process is in a session leader.
- l: Process is in multi-threaded mode.
- +: Process is in the foreground process group.
It is because the parent process doesn’t call wait() so the child process’s entry still exists in the process table and stay in Zombie state. We might find this in the applications which are not programmed properly.
Actually every child process will always first becomes a zombie before being removed form the process table. The parent process reads the exit status of the child process which reaps off the child process entry from the process table. However, it happens quickly so we can not notice about that when looking at
STAT column in
ps aux command output.
Simple c program to create zombie process
// Child PID is returned in parent process
pid_t child_pid = fork();
// Parent process
if (child_pid > 0)
// Child process
Simply speaking, signals are the way to delivery asynchronous events to the application. For instance, We press
CTRL+C while running an application, a singal will be sent to the application’s process to terminate it.
In Linux, every signal has a name that begins with characters SIG and a value. Following is the standard signals:
Signal Value Action Comment
From the Linux shell, to send a signal to a running process, we can use
kill command. For instance
kill -9 14112 will send SIGKILL signal to force killing the process that has PID 14112.
An inode is a data structure in Unix that contains metadata about a file. These following inode attributes which may be retrieved by the stat system call:
- owner (UID, GID)
- atime, ctime, mtime
- blocks list of where the data is
Note: The filename is present in the parent directory’s structure, not in the inode.
Hard link shares the same inode number as the source while the Soft link uses a different inode number.
As a result:
- If the source file is moved, the Soft link will be broken but Hard link will still working fine.
- The Hard link is only valid on the same filesystem while Soft link is not.
ARP stands for Address Resolution Protocol. When you try to ping an IP address on your local network, say 192.168.1.1, your system has to turn the IP address 192.168.1.1 into a MAC address. This involves using ARP to resolve the address, hence its name.
If the IP address is not found in the ARP table, the system will then send a broadcast packet to the network using the ARP protocol to ask
who has 192.168.1.1. Because it is a broadcast packet, it is sent to a special MAC address that causes all machines on the network to receive it. Any machine with the requested IP address will reply with an ARP packet that says
I am 192.168.1.1, and this includes the MAC address which can receive packets for that IP.
To broadcast the ARP reply packets to update machine’s ARP table in the local network, we can use arping command. For instance
arping -A -I eth0 192.168.1.123.
This topic is more about theory. However the more you understand about it, the more you know how to work with network protocols, troubleshooting and programming.
- Understand the basic of TCP/IP concept. How does it compare to OSI model ?
- Understand how TCP and UDP work and their design.
DHCP stands for Dynamic Host Configuration Protocol, is a network management protocol used on UDP/IP networks whereby a DHCP server dynamically assigns an IP address and other network configuration parameters to each device on a network so they can communicate with other IP networks.
DHCP client and server communicate over UDP with following steps:
HOST (68/udp) SERVER (67/udp)
DNS stands for Domain Name System, is a hierarchical and decentralized naming system for computers, services, or other resources connected to the Internet or a private network.
To understand how does DNS works, you can take a look at this awesome comic website. Basically, there are following DNS query steps after you hit
example.com from your browser address bar:
- Step 1: The browser check your machine’s DNS cache, hosts file.
- Step 2: If local cache is not available, your machine will ask the configured DNS servers. If the DNS server has the cache for
example.com, it will return the result, otherwise
- Step 3: The DNS server queries to an Authoritive Root Name Server which gives the list of authoritive DNS server of each top-level domain (.com, .org, .net, etc.).
- Step 4: Authoritive Top-Level Domain (TLD) Name Server gives the list of authoritive Name Server (NS) for the
- Step 5: Query to that Name Servers to get the IP address for
Note: In the step 4, the Name Server is a sub-domain of the domain that we want to resolve. This cause a problem with circular resolving. So in this case, in the Name Server DNS response, there must be an addional section which includes the IP address for those name servers. It is call glue record.
Understand how routing protocols work help you to troubleshoot network issue easier. Although we are not going to become a Network Engineer, it is good to know the basic of routing protocols and when / why we should use them.
There are routing protocols we should care:
- Static routing
- Routing Information Protocol (RIP)
- Open Shortest Path First (OSPF)
- Enhanced Interior Gateway Routing Protocol (EIGRP)
- Border Gateway Protocol (BGP)
The Static routing is the simplest routing method that supported in Linux by default. I highly recommend you to learn and understand how does it work.
In Linux, the command
route -n will show the
main routing table as it was configured by a default ip rule. You also should take a look at Linux Policy Based Routing to learn about routing table and rule.
Traceroute is a program that shows us the route taken by the packets in a network. Understand how does it work help us to examine the problems that might cause network communication issue between hosts in our network.
Following are steps happen when we run traceroute:
- Step 1: Traceroute creates a UDP packet from the source to destination with a
TTL = 1. This UDP packet reaches the first router (hop) where the router decrements the value of TTL by 1, thus making our UDP packet’s
TTL = 0and hence the packet gets dropped.
- Step 2: Noticing that the packet got dropped, the router sends an ICMP
Time exceededmessage back to the source.
- Step 3: Traceroute makes a note of the router’s address and the time taken for the round-trip.
- Step 4: Traceroute sends two more packets in the same way to get an average value of the round-trip time. Usually, the first round-trip takes longer than the other two due to the delay in ARP finding the physical address, the address stays in the ARP.
- Step 5: The steps will be repeated again and again until it reaches the destination. During these steps, the TTL is
incremented by 1everytime its UDP packet is sent to next router (hop).
- Step 6: Once the destination is reached, ICMP Time exceeded message will NOT sent back this time because the destination has already been reached. An ICMP
Destination Unreachablemessage will be sent back instead. This is because Traceroute specified a destination port that not usually used by UDP so the destination host drops it after verifying the ICMP packet header.
- Step 7: Traceroute program receives Destination Unreachable message and understands that the destination has been reached. Then the result is printed to user.
Troubleshooting is a large topic. The are several scenarios we might will be asked during the interview. I will try to list down few useful tools that could help for the troubleshooting interview below.
dstat is a powerful tool for generating Linux system resource statistics. With this single tool you can quickly monitor the system resource utilization such as cpu, memory, disk, network, etc.
Send HTTP request to a web server to retrieve the HTML content
telnet 220.127.116.11 80
There are 3 important inputs we need to specify:
- GET / HTTP/1.1: We want to get the index page on
/using HTTP version 1.1
- Host: ndk.name: A web server might serves several websites, using Host header will help us to get the needed website.
- 2 more spaces at the end.
netstat is helpful to show the server network connection, routings, interface statistics, etc. It is usually used to find the listening ports on the server. For instance
sudo netstat -tulnp
The above command will list down all the processes are listening on UDP and TCP ports on the server. The
p option is to translate the process name from its PID. If you don’t need it,
sudo is no need in the command.
ngrep helps you to search and filter the network packets going through your network interface. It has ability to look for a regular expression in the payload of the packet, and show the matching packets on a screen or console. Unlike Tcpdump and Wireshark, ngrep provides a query syntax which is more readable and easy to understand.
ngrep -q 'port 25'
Back to the example of using telnet to retrieve a website. If that website is running on HTTPS, we have to be able to work with TLS/SSL protocol so openssl is what we need.
openssl s_client -connect 18.104.22.168:443
We can also use openssl to show the remote server certificate
openssl s_client -showcerts -servername ndk.name -connect 22.214.171.124:443
Perf is a performance analysis tool for Linux. If above
strace tool can find the system calls executed by a process,
perf can do the same, and more. Unlike strace, perf won’t slow down our program during the profiling.
On a high load machine which has CPU utilization at 100% and we want to quickly figure out the problem. Following command can give the overhead overview.
sudo perf top
If we want use
perf for a specific program, we can use:
sudo perf record <command>
CTRL + C to end the profiling. A
perf.data file will be saved in the current directory. To see the result, run:
sudo perf output
We can always use
perf to look at a specific system call with
-e option. To list down all supported system calls, run:
sudo perf list
Following example looks at the
sys_enter_connect system call to find the functions that sending network traffic out.
sudo perf record -e syscalls:sys_enter_connect -ag
Stack: The stack stores values in the order it gets them, and removes the values in the opposite order. This is referred to as last in, first out. The stack is fast because of the way it accesses the data. It never has to search for a place to put new data or a place to get data from because that place is always the top. All data on the stack must take up a known, fixed size.
Heap: The heap is less organized. When we put data onto the heap, we ask for some amount of space. The operating system finds an empty spot somewhere in the heap that is big enough, marks it as being in use, and returns to us a pointer, which is the address of that location. Accessing the data in the heap is slower than accessing the data on the stack because we have to follow a pointer to get there.
TO BE UPDATED…