SERVICE MESH

Debugging with a Service Mesh

September 23, 2021 | 14 min read

Table of contents

Service Route Configuration for Issue Mitigation

Here we talk about using a service mesh to debug and mitigate some types of app failures. We’ll be looking at several of the capabilities that service meshes may provide. Each service mesh technology supports a unique set of such capabilities.

We use examples from Linkerd to illustrate the capabilities that service meshes may provide, but the fundamental concepts discussed here will apply to any service mesh.

Service Mesh Status Checks

In many situations, it’s helpful to first do a check of the status of the service mesh components. If the mesh itself is having a failure--such as its control plane not working--then app failures you are seeing may actually be caused by a larger problem, and not an issue with the app itself.

Below is an example of possible output from running the

linkerd check

command. Take some time to review each group of checks in this output. There’s a link to the Linkerd documentation for each group of checks so that you can learn more about them and see examples of failures.

1. Can the service mesh communicate with Kubernetes? (kubernetes-api checks)

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

2. Is the Kubernetes version compatible with the service mesh version? (kubernetes-version checks)

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

3. Is the service mesh installed and running? (linkerd-existence checks)

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

4. Is the service mesh’s control plane properly configured? (linkerd-config checks)

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

5. Are the service mesh’s credentials valid and up to date? (linkerd-identity checks)

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

6. Is the API for the control plane running and ready? (linkerd-api checks)

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

7. Is the service mesh installation up to date? (linkerd-version checks)

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

8. Is the service mesh control plane up to date? (control-plane-version checks)

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

If any of the status checks failed, you would see output similar to the example below. It would indicate which check failed, and often also provide you with additional information on the nature of the failure to help you with troubleshooting it.

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
× [prometheus] control plane can talk to Prometheus
          Error calling Prometheus from the control plane: 
server_error: server error: 503
          see https://linkerd.io/checks/#l5d-api-control-api for hints

If all of the status checks passed and no issues were detected, the last line of the output will be like this one:

Status check results are √

Service Proxy Status Checks

Sometimes you may want to do a status check for other aspects of the service mesh, in addition to or instead of the ones you’ve just reviewed. For instance, you may just want to check the status of the service proxies that your app is supposed to be using. In Linkerd, you can do that by adding the

--proxy

flag to the

linkerd check

command.

Note that Linkerd refers to “service proxies” as “data plane proxies.”

Below is an excerpt of possible output from running

linkerd check --proxy

. This command runs all the same checks as the

linkerd check

command, plus a few additional ones specific to data planes. The duplication of the

linkerd check

output has been omitted here for brevity. The sample output shows the additional checks performed because of the

--proxy

flag.

9. Are the credentials for each of the data plane proxies valid and up to date? (linkerd-identity-data-plane)

linkerd-identity-data-plane
---------------------------
√ data plane proxies certificate match CA

10. Are the data plane proxies running and working fine? (linkerd-identity-data-plane)

linkerd-data-plane
------------------
√ data plane namespace exists
√ data plane proxies are ready
√ data plane proxy metrics are present in Prometheus
√ data plane is up-to-date
√ data plane and cli versions match

Service Route Metrics

If your service mesh status checks don’t report any problems, a common next step in troubleshooting is to look at the metrics for the app’s service routes in the mesh. In other words, you want to see measurements of how each of the routes within the mesh that the app uses are performing. These measurements are often useful for purposes other than troubleshooting, such as determining how the performance of an app could be improved.

Let’s suppose that you’re troubleshooting an app that is experiencing intermittent slowdowns and failures. You could look at the per-route metrics for the app to see if there’s a particular route that is the cause or is involved somehow.

The

linkerd routes

command returns a snapshot of the performance metrics for each route within a particular scope. That scope is defined by what’s called a service profile. (You’ll learn more about service profiles in the next section.) Below is an example of the command’s output header and one line of sample metrics. They have been reformatted from the original to improve their readability.

ROUTE SERVICE SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
GET / webapp 100.00% 0.6rps 15ms 20ms 20ms

By default, the metrics are for inbound requests only. This example shows the performance of requests made over the

“GET /”

route to the

webapp

service. In this case, 100% of the recent requests succeeded, with an average of 0.6 requests processed per second.

The three latency metrics indicate how much time it took to handle the requests based on percentiles. P50 refers to the 50th percentile--the median time, in this case 15 milliseconds. P95 refers to the 95th percentile, which indicates that the app is handling 95 percent of the requests as fast as or faster than 20 milliseconds. P99 provides the same type of measurement for the 99th percentile.

Viewing the metrics for each route can indicate where slowdowns or failures are occurring within the mesh, and also where things are functioning normally. It can narrow down what the problem might be, or in some cases point you right to the culprit.

Service Route Configuration for Issue Mitigation

You may want to change the configuration of particular service routes to mitigate problems that occur. For example, you could change the timeouts and automatic retries for a particular route so that attempts to a problematic pod switch to another pod more quickly. That could reduce delays for users while you continue to troubleshoot the problem or while a developer changes code to address the underlying issue.

In Linkerd, the mechanism for configuring a route is called a service profile. As mentioned earlier, service profiles can also be used to specify which routes to provide metrics on for the

linkerd routes

command.

For more information on creating and using service profiles, see https://linkerd.io/2/features/service-profiles/.

Request Logging

If you need more detail about requests and responses than you can get from the service route metrics, you may want to do logging of the individual requests and responses.

Caution: logging requests can generate a rapidly growing amount of log data. In many cases you will only need to see a few logged requests, and not massive volumes of them.

Here is an example of a few logged requests. This log was generated by running the

linkerd tap

command. Blank lines have been added between the log entries to improve readability. These three entries all involve the same request.

The first one shows what the request was, and the second shows the status code that was returned (in this case, a 503, Service Unavailable). The second and third entries both contain metrics for how this request was handled. This additional information, beyond what could be seen in route-level metrics, may help you to narrow your search for the problem.

req id=9:49 proxy=out src=10.244.0.53:37820 dst=10.244.0.50:7001 tls=true :method=HEAD :authority=authors:7001 :path=/authors/3252.json

rsp id=9:49 proxy=out src=10.244.0.53:37820 dst=10.244.0.50:7001 tls=true :status=503 latency=2197µs

end id=9:49 proxy=out src=10.244.0.53:37820 dst=10.244.0.50:7001 tls=true duration=16µs response-length=0B

For more information on the

linkerd tap

command, see https://linkerd.io/2/reference/cli/tap/.

Service Proxy Logging

Sometimes you want to better understand what is happening within a particular service proxy. You may be able to do that by increasing the extent of the logging that the service proxy is performing, such as recording more events or recording more details about each event.

Be very careful before altering service proxy logging because it can negatively impact the proxy’s performance, and the volume of the logs themselves can also be overwhelming.

Linkerd allows its service proxy log level to be changed in various ways. For more information, see https://linkerd.io/2/tasks/modifying-proxy-log-level/ and https://linkerd.io/2/reference/proxy-log-level/ .

Here is an example of what Linkerd’s service proxy logs may look like.

[ 326.996211471s] WARN inbound:accept{peer.addr=10.244.0.111:55288}:source{target.addr=10.244.0.131:7002}:http1{name=books.booksapp.svc.cluster.local:7002 port=7002 keep_alive=true wants_h1_upgrade=false was_absolute_form=false}:profile{addr=books.booksapp.svc.cluster.local:7002}:daemon:poll_profile: linkerd2_service_profiles::client: Could not fetch profile error=grpc-status: Unavailable, grpc-message: "proxy max-concurrency exhausted"

Injecting a Debug Container

If you need to take an even closer look at what’s happening inside a pod, you may be able to have your service mesh inject a debug container into that pod. A debug container is designed to monitor the activity within the pod and to collect information on that activity, such as capturing network packets.

In Linkerd, you can inject a debug container by adding the

--enable-debug-sidecar

flag to the

linkerd inject

command. You can then open a shell to the debug container and issue commands within the container to gather more information and continue troubleshooting the problem.

For more information on using Linkerd’s debug container (called the “debug sidecar,”) see https://linkerd.io/2/tasks/using-the-debug-container/ .

Using the Telepresence Tool

Telepresence is a debugging tool that is hosted by the Cloud Native Computing Foundation (CNCF). It can be used instead of or in addition to injecting a debug container so you can examine what’s happening inside a pod.

Using Telepresence, you can run a single process (a service or debug tool) locally, and a two-way network proxy enables that local service to effectively operate as part of the remote Kubernetes cluster. This architecture means that Telepresence is usually not run in production clusters; it’s intended for use in testing or staging.

Because you are running a service locally, you can use whatever debugging or testing tools you’d like to monitor and probe the executing service. You can also edit your service using whatever tool you choose.