circuit breaker Archives - Piotr's TechBlog

Circuit breaker and retries on Kubernetes with Istio and Spring Boot

piotr.minkowski — Wed, 03 Jun 2020 07:29:41 +0000

An ability to handle communication failures in an inter-service communication is an absolute necessity for every single service mesh framework. It includes handling of timeouts and HTTP error codes. In this article I’m going to show how to configure retry and circuit breaker mechanisms using Istio. The same as for the previous article about Istio Service mesh on Kubernetes with Istio and Spring Boot we will analyze a communication between two simple Spring Boot applications deployed on Kubernetes. But instead of very basic example we are going to discuss more advanced topics.

Example

For demonstrating usage of Istio and Spring Boot I created a repository on GitHub with two sample applications: callme-service and caller-service. The address of this repository is https://github.com/piomin/sample-istio-services.git. The same repository has been for the first article about service mesh with Istio already mentioned in the preface.

Architecture

The architecture of our sample system is pretty similar to those in the previous article. However, there are some differences. We are not injecting a fault or delay using Istio components, but directly on the application inside the source code. Why? Now, we will be able to handle directly the rules created for callme-service, not on the client side as before. Also we are running two instances of version v2 of callme-service application to test how circuit breaker works for more than instances of the same service (or rather the same Deployment). The following picture illustrates the currently described architecture.

Spring Boot applications

We are starting from an implementation of the sample applications. The application callme-service is exposing two endpoints that return information about version and instance id. The endpoint GET /ping-with-random-error sets HTTP 504 error code as a response for ~50% of requests. The endpoint GET /ping-with-random-delay returns response with random delay between 0s and 3s. Here’s the implementation of @RestController on the callme-service side.

@RestController
@RequestMapping("/callme")
public class CallmeController {

    private static final Logger LOGGER = LoggerFactory.getLogger(CallmeController.class);
    private static final String INSTANCE_ID = UUID.randomUUID().toString();
    private Random random = new Random();

    @Autowired
    BuildProperties buildProperties;
    @Value("${VERSION}")
    private String version;

    @GetMapping("/ping-with-random-error")
    public ResponseEntity pingWithRandomError() {
        int r = random.nextInt(100);
        if (r % 2 == 0) {
            LOGGER.info("Ping with random error: name={}, version={}, random={}, httpCode={}",
                    buildProperties.getName(), version, r, HttpStatus.GATEWAY_TIMEOUT);
            return new ResponseEntity<>("Surprise " + INSTANCE_ID + " " + version, HttpStatus.GATEWAY_TIMEOUT);
        } else {
            LOGGER.info("Ping with random error: name={}, version={}, random={}, httpCode={}",
                    buildProperties.getName(), version, r, HttpStatus.OK);
            return new ResponseEntity<>("I'm callme-service" + INSTANCE_ID + " " + version, HttpStatus.OK);
        }
    }

    @GetMapping("/ping-with-random-delay")
    public String pingWithRandomDelay() throws InterruptedException {
        int r = new Random().nextInt(3000);
        LOGGER.info("Ping with random delay: name={}, version={}, delay={}", buildProperties.getName(), version, r);
        Thread.sleep(r);
        return "I'm callme-service " + version;
    }

}

The application caller-service is also exposing two GET endpoints. It is using RestTemplate to call the corresponding GET endpoints exposed by callme-service. It also returns the version of caller-service, but there is only a single Deployment of that application labeled with version=v1.

@RestController
@RequestMapping("/caller")
public class CallerController {

    private static final Logger LOGGER = LoggerFactory.getLogger(CallerController.class);

    @Autowired
    BuildProperties buildProperties;
    @Autowired
    RestTemplate restTemplate;
    @Value("${VERSION}")
    private String version;


    @GetMapping("/ping-with-random-error")
    public ResponseEntity pingWithRandomError() {
        LOGGER.info("Ping with random error: name={}, version={}", buildProperties.getName(), version);
        ResponseEntity responseEntity =
                restTemplate.getForEntity("http://callme-service:8080/callme/ping-with-random-error", String.class);
        LOGGER.info("Calling: responseCode={}, response={}", responseEntity.getStatusCode(), responseEntity.getBody());
        return new ResponseEntity<>("I'm caller-service " + version + ". Calling... " + responseEntity.getBody(), responseEntity.getStatusCode());
    }

    @GetMapping("/ping-with-random-delay")
    public String pingWithRandomDelay() {
        LOGGER.info("Ping with random delay: name={}, version={}", buildProperties.getName(), version);
        String response = restTemplate.getForObject("http://callme-service:8080/callme/ping-with-random-delay", String.class);
        LOGGER.info("Calling: response={}", response);
        return "I'm caller-service " + version + ". Calling... " + response;
    }

}

Handling retries in Istio

The definition of Istio DestinationRule is the same as before in my article Service mesh on Kubernetes with Istio and Spring Boot. There two subsets created for instances labeled with version=v1 and version=v2. Retries and timeout may be configured on VirtualService. We may set the number of retries and the conditions under which retry takes place (a list of enum strings). The following configuration is also setting 3s timeout for the whole request. Both these settings are available inside HTTPRoute object. We also need to set a timeout per single attempt. In that case I set 1s. How does it work in practice? We will analyze it in a simple example.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: callme-service-destination
spec:
  host: callme-service
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: callme-service-route
spec:
  hosts:
    - callme-service
  http:
    - route:
      - destination:
          host: callme-service
          subset: v2
        weight: 80
      - destination:
          host: callme-service
          subset: v1
        weight: 20
      retries:
        attempts: 3
        perTryTimeout: 1s
        retryOn: 5xx
      timeout: 3s

Before deploying sample applications we should increase a level of logging. We may easily enable Istio access logging. Thanks to that Envoy proxies print access logs with all incoming requests and outgoing responses to their standard output. Analyze of logging entries will be especially usable for detecting retry attempts.

$ istioctl manifest apply --set profile=default --set meshConfig.accessLogFile="/dev/stdout"

Now, let’s send a test request to the HTTP endpoint GET /caller/ping-with-random-delay. It calls the randomly delayed callme-service endpoint GET /callme/ping-with-random-delay. Here’s the request and response for that operation.

Seemingly it’s a very clear situation. But let’s check out what is happening under the hood. I have highlighted the sequence of retries. As you see Istio has performed two retries, since the first two attempts were longer than perTryTimoeut which has been set to 1s. Both two attempts were timeout by Istio, which can be verified in its access logs. The third attempt was successful, since it took around 400ms.

Retrying on timeout is not the only available option of retrying in Istio. In fact, we may retry all 5XX or even 4XX codes. The VirtualService for testing just error codes is much simpler, since we don’t have to configure any timeouts.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: callme-service-route
spec:
  hosts:
    - callme-service
  http:
    - route:
      - destination:
          host: callme-service
          subset: v2
        weight: 80
      - destination:
          host: callme-service
          subset: v1
        weight: 20
      retries:
        attempts: 3
        retryOn: gateway-error,connect-failure,refused-stream

We are going to call HTTP endpoint with GET /caller/ping-with-random-error, that is calling endpoint GET /callme/ping-with-random-error exposed by callme-service. It is returning HTTP 504 for around 50% of incoming requests. Here’s the request and successful response with 200 OK HTTP code.

Here are the logs, which illustrate what happened on the callme-service side. The requests have been retried 2 times, since the two first attempts result in HTTP error code.

Istio circuit breaker

A circuit breaker is configured on the DestinationRule object. We are using TrafficPolicy for that. First we will not set any retries used for the previous sample, so we need to remove it from VirtualService definition. We should also disable any retries on the connectionPool inside TrafficPolicy. And now the most important. For configuring a circuit breaker in Istio we are using OutlierDetection object. Istio circuit breaker implementation is based on consecutive errors returned by the downstream service. The number of subsequent errors may be configured using properties consecutive5xxErrors or consecutiveGatewayErrors. The only difference between them is in the HTTP errors they are able to handle. While consecutiveGatewayErrors is just for 502, 503 and 504, the consecutive5xxErrors is used for 5XX codes. In the following configuration of callme-service-destination I used set consecutive5xxErrors on 3. It means that after 3 errors in row an instance (pod) of application is removed from load balancing for 1 minute (baseEjectionTime=1m). Because we are running two pods of callme-service in version v2 we also need to override a default value of maxEjectionPercent to 100%. A default value of that property is 10%, and it indicates a maximum % of hosts in the load balancing pool that can be ejected.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: callme-service-destination
spec:
  host: callme-service
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 1
        maxRequestsPerConnection: 1
        maxRetries: 0
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 30s
      baseEjectionTime: 1m
      maxEjectionPercent: 100
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: callme-service-route
spec:
  hosts:
    - callme-service
  http:
    - route:
      - destination:
          host: callme-service
          subset: v2
        weight: 80
      - destination:
          host: callme-service
          subset: v1
        weight: 20

The fastest way of deploying both applications is with Jib and Skaffold. First you go to directory callme-service and execute skaffold dev command with optional --port-forward parameter.

$ cd callme-service
$ skaffold dev --port-forward

Then do the same for caller-service.

$ cd caller-service
$ skaffold dev --port-forward

Before sending some test requests let’s run the second instance of v2 version of callme-service, since Deployment sets parameter replicas to 1. To do that we need to run the following command.

$ kubectl scale --replicas=2 deployment/callme-service-v2

Now, let’s verify the status of deployment on Kubernetes. There are 3 deployments. The deployment callme-service-v2 has to running pods.

After that we are ready to send some test requests. We are calling endpoint GET /caller/ping-with-random-error exposed by caller-service, that is calling endpoint GET /callme/ping-with-random-error exposed by callme-service. Endpoint exposed by callme-service returns HTTP 504 for 50% of requests. I have already set port forwarding for callme-service on 8080, so the command used calling application is: curl http://localhost:8080/caller/ping-with-random-error.
Now, let’s analyze responses from caller-service. I have highlighted the responses with HTTP 504 error code from instance of callme-service with version v2 and generated id 98c068bb-8d02-4d2a-9999-23951bbed6ad. After 3 error responses in row from that instance, it is immediately removed from load balancing pool, what results in sending all other requests to the second instance of callme-service v2 having id 00653617-58e1-4d59-9e36-3f98f9d403b8. Of course there is still available a single instance of callme-service v1, that is receiving 20% of total requests send by caller-service.

Ok, let’s check what will happen if a single instance callme-service v1 returns 3 errors in row. I have also highlighted those error responses in the picture with logs visible below. Because there is only one instance of callme-service v1 in the pool, there is no chance to redirect an incoming traffic to other instances. That’s why Istio is returning HTTP 503 for the next request sent to callme-service v1. The same response is returned within 1 next minute since the circuit is open.

The post Circuit breaker and retries on Kubernetes with Istio and Spring Boot appeared first on Piotr's TechBlog.

Circuit Breaking In Spring Cloud Gateway With Resilience4J

piotr.minkowski — Wed, 11 Dec 2019 11:06:24 +0000

In the newest version of Spring Cloud Gateway (2.2.1) we may take an advantage of a new implementation of circuit breaker built on top of project Resilience4J (https://github.com/resilience4j/resilience4j). Resilience4J has been selected as a replacement for Netflix’s Hystrix, which has been moved to maintenance mode. Of course, you can still use Hystrix as circuit breaker implementation, however it is deprecated and probably won’t be available in the future versions of Spring Cloud. A new implementation is called no different than just Spring Cloud Circuit Breaker.
You can find another interesting example of using Spring Cloud Gateway components in one of my previous articles. I have already described how to implement rate limiting based on Redis here: Rate Limiting In Spring Cloud Gateway With Redis. In the current article I’m using the same GitHub repository as earlier: sample-spring-cloud-gateway. I’m going to show some sample scenarios of using Spring Cloud Circuit Breaker with Spring Cloud Gateway including a fallback pattern.

1. Dependencies

To succesfully test some scenarios of using a circuit breaker pattern with Spring Cloud Gateway we need to include a reactive version of Spring Cloud Circuit Breaker since gateway is started on reactive Netty server. We will simulate downstream service using MockServer provided within the Testcontainers framework. It is provisioned inside the test by a mock client written in Java.


   org.springframework.cloud
   spring-cloud-starter-gateway


   org.springframework.boot
   spring-cloud-starter-circuitbreaker-reactor-resilience4j


   org.projectlombok
   lombok


   org.springframework.boot
   spring-boot-starter-test
   test


   org.testcontainers
   mockserver
   1.12.3
   test


   org.mock-server
   mockserver-client-java
   3.10.8
   test


   com.carrotsearch
   junit-benchmarks
   0.7.2
   test

2. Enabling Spring Cloud Gateway Circuit Breaker with Resilience4J

To enable circuit breaker built on top of Resilience4J we need to declare a Customizer bean that is passed a ReactiveResilience4JCircuitBreakerFactory. The very simple configuration contains default circuit breaker settings and and defines timeout duration using TimeLimiterConfig. For the first test I decided to set 200 milliseconds timeout.

@Bean
public Customizer defaultCustomizer() {
    return factory -> factory.configureDefault(id -> new Resilience4JConfigBuilder(id)
        .circuitBreakerConfig(CircuitBreakerConfig.ofDefaults())
        .timeLimiterConfig(TimeLimiterConfig.custom().timeoutDuration(Duration.ofMillis(200)).build())
        .build());
}

3. Building Test Class

In the next step we are creating a test class. Before running the test it is starting and provisioning an instance of mock server. We are defining two endpoints. The second of them /2 adds a delay of 200 milliseconds, which exceeds the timeout defined in the circuit breaker configuration.
We are also setting configuration of Spring Cloud Gateway route which is addressed to the currently started instance of mock server. To enable the circuit breaker for our route we have to define a CircuitBreaker filter with a given name. The test is repeated 200 times. It calls the delayed and not delayed endpoint in 50/50 proportion. Here’s the Spring Cloud Gateway test class.

@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.DEFINED_PORT)
@RunWith(SpringRunner.class)
public class GatewayCircuitBreakerTest {

    private static final Logger LOGGER = LoggerFactory.getLogger(GatewayRateLimiterTest.class);

    @Rule
    public TestRule benchmarkRun = new BenchmarkRule();

    @ClassRule
    public static MockServerContainer mockServer = new MockServerContainer();

    @Autowired
    TestRestTemplate template;
    int i = 0;

    @BeforeClass
    public static void init() {
        System.setProperty("spring.cloud.gateway.routes[0].id", "account-service");
        System.setProperty("spring.cloud.gateway.routes[0].uri", "http://192.168.99.100:" + mockServer.getServerPort());
        System.setProperty("spring.cloud.gateway.routes[0].predicates[0]", "Path=/account/**");
        System.setProperty("spring.cloud.gateway.routes[0].filters[0]", "RewritePath=/account/(?.*), /$\\{path}");
        System.setProperty("spring.cloud.gateway.routes[0].filters[1].name", "CircuitBreaker");
        System.setProperty("spring.cloud.gateway.routes[0].filters[1].args.name", "exampleSlowCircuitBreaker");
        MockServerClient client = new MockServerClient(mockServer.getContainerIpAddress(), mockServer.getServerPort());
        client.when(HttpRequest.request()
            .withPath("/1"))
            .respond(response()
                .withBody("{\"id\":1,\"number\":\"1234567890\"}")
                .withHeader("Content-Type", "application/json"));
        client.when(HttpRequest.request()
            .withPath("/2"))
            .respond(response()
                .withBody("{\"id\":2,\"number\":\"1234567891\"}")
                .withDelay(TimeUnit.MILLISECONDS, 200)
                .withHeader("Content-Type", "application/json"));
    }

    @Test
    @BenchmarkOptions(warmupRounds = 0, concurrency = 1, benchmarkRounds = 200)
    public void testAccountService() {
        int gen = 1 + (i++ % 2);
        ResponseEntity r = template.exchange("/account/{id}", HttpMethod.GET, null, Account.class, gen);
        LOGGER.info("{}. Received: status->{}, payload->{}, call->{}", i, r.getStatusCodeValue(), r.getBody(), gen);
    }

}

Here’s the result of the currently discussed test. With default settings it opens the circuit after processing 100 requests with 50% error rate. The logs visible below include a sequence number of requests, HTTP response status code, response body and URL of the called endpoint.

We may change the default settings a little. To do that we should define a custom CircuitBreakerConfig. One of the properties we can customize is slidingWindowSize. The property slidingWindowSize defines how many outcome calls has to be recorded when a circuit breaker is closed. Assuming we have the same test endpoints what will happen if we change this value to 10 as shown below?

@Bean
public Customizer defaultCustomizer() {
    return factory -> factory.configureDefault(id -> new Resilience4JConfigBuilder(id)
        .circuitBreakerConfig(CircuitBreakerConfig.custom()
            .slidingWindowSize(10)
            .build())
        .timeLimiterConfig(TimeLimiterConfig.custom().timeoutDuration(Duration.ofMillis(200)).build()).build());
}

Here’s the result. The circuit is open just after processing 10 requests when at least 50% of them are timeouted.

Moreover, we may change failureRateThreshold. This property is responsible for configuring the failure rate threshold in percentage. If the failure rate is equal or greater than the threshold the circuit breaker is switched to open and starts short-circuiting calls. It is not difficult to predict what will happen if we change it for our current scenario to 66.6F. The circuit will never be opened.

@Bean
public Customizer defaultCustomizer() {
    return factory -> factory.configureDefault(id -> new Resilience4JConfigBuilder(id)
        .circuitBreakerConfig(CircuitBreakerConfig.custom()
            .slidingWindowSize(10)
            .failureRateThreshold(66.6F)
            .build())
        .timeLimiterConfig(TimeLimiterConfig.custom().timeoutDuration(Duration.ofMillis(200)).build()).build());
}

4. Spring Cloud Gateway circuit breaker Customization

We were starting with really basic samples. Let’s do something more interesting! First, we will set a really small value of sliding window size. It is set to 5. Thanks to that we will be able to observe the full result of the current test scenario after processing only a few requests. The next step is to modify the rules defined on the mock server. Now, we will delay only 5 first requests sent to the /2 endpoint. After receiving 5 requests it starts to work fine without adding any delay. Thanks to that fact our circuit breaker would be able to back from OPEN state to CLOSE after some time. But first things first, here are the code defining mock endpoints for the current test.

MockServerClient client = new MockServerClient(mockServer.getContainerIpAddress(), mockServer.getServerPort());
client.when(HttpRequest.request()
   .withPath("/1"))
   .respond(response()
      .withBody("{\"id\":1,\"number\":\"1234567890\"}")
      .withHeader("Content-Type", "application/json"));
client.when(HttpRequest.request()
   .withPath("/2"), Times.exactly(5))
   .respond(response()
      .withBody("{\"id\":2,\"number\":\"1234567891\"}")
      .withDelay(TimeUnit.MILLISECONDS, 200)
      .withHeader("Content-Type", "application/json"));
client.when(HttpRequest.request()
   .withPath("/2"))
   .respond(response()
      .withBody("{\"id\":2,\"number\":\"1234567891\"}")
      .withHeader("Content-Type", "application/json"));

As I mentioned before the slidingWindowSize is now equal to 5. If there are 3 timeouts during the last 5 calls the circuit is switched to OPEN state. We can configure how long the circuit should stay in the OPEN state without trying to process any request. The parameter waitDurationInOpenState, which is responsible for that, has been set to 30 milliseconds. Therefore, after 30 milliseconds the circuit is switched to HALF_OPEN state, which means that the incoming requests are processed again. We can also configure a number of permitted calls in the HALF_OPEN state. The property permittedNumberOfCallsInHalfOpenState is set to 5 instead of default value 10. In these five attempts, we get only 2 timeouts, since we set 5 repeats for delayed service on the mock server and the first 3 timeouts have been in the beginning before opening a circuit. Here’s our current configuration of Spring Cloud Circuit Breaker.

@Bean
public Customizer defaultCustomizer() {
   return factory -> factory.configureDefault(id -> new Resilience4JConfigBuilder(id)
      .circuitBreakerConfig(CircuitBreakerConfig.custom()
         .slidingWindowSize(5)
         .permittedNumberOfCallsInHalfOpenState(5)
         .failureRateThreshold(50.0F)
         .waitDurationInOpenState(Duration.ofMillis(30))
         .build())
      .timeLimiterConfig(TimeLimiterConfig.custom().timeoutDuration(Duration.ofMillis(200)).build()).build());
}

The following diagram illustrates our scenario.

And here’s the result of our current test. The circuit has been opened after processing 6 requests. There were 3 incoming requests that had not been processed during 30 milliseconds of being in open state. After that time it has been switched to half open state and finally it moved back to close state.

What will happen if we increase the number of delayed requests in this scenario to 20?

client.when(HttpRequest.request()
   .withPath("/2"), Times.exactly(20))
   .respond(response()
      .withBody("{\"id\":2,\"number\":\"1234567891\"}")
      .withDelay(TimeUnit.MILLISECONDS, 200)
      .withHeader("Content-Type", "application/json"));

The circuit will be switched between OPEN and HALF_OPEN state until the downstream service is delaying the responses.

5. Adding Fallback

As you probably noticed, if the request to the downstream service has been finished with timeout the gateway returns HTTP status HTTP 504 - Gateway Timeout. Moreover, if a circuit is open the gateway is returning HTTP Status HTTP 503 - Service Unavailable. To prevent from returning error status code on the gateway we may enable fallback endpoint for our route. To do that we have to set property fallbackUri using forward: scheme. Here’s the current configuration of the test route. I included the endpoint /fallback/account as fallback URI.


System.setProperty("spring.cloud.gateway.routes[0].id", "account-service");
System.setProperty("spring.cloud.gateway.routes[0].uri", "http://192.168.99.100:" + mockServer.getServerPort());
System.setProperty("spring.cloud.gateway.routes[0].predicates[0]", "Path=/account/**");
System.setProperty("spring.cloud.gateway.routes[0].filters[0]", "RewritePath=/account/(?.*), /$\\{path}");
System.setProperty("spring.cloud.gateway.routes[0].filters[1].name", "CircuitBreaker");
System.setProperty("spring.cloud.gateway.routes[0].filters[1].args.name", "exampleSlowCircuitBreaker");
System.setProperty("spring.cloud.gateway.routes[0].filters[1].args.fallbackUri", "forward:/fallback/account");

The fallback endpoint is exposed on the gateway. I defined a simple controller class that implements a single fallback method.

@RestController
@RequestMapping("/fallback")
public class GatewayFallback {

    @GetMapping("/account")
    public Account getAccount() {
        Account a = new Account();
        a.setId(2);
        a.setNumber("123456");
        return a;
    }

}

Assuming we have exactly the same scenario as in the previous section the current test is returning only HTTP 200 instead of responses with HTTP 5xx as shown below.

6. Handling Slow Responses

In all previous examples we have set a short timeout on response, what results in HTTP 504 - Gateway Timeout or fallback. However, we don’t have to timeout the requests, but we can just set a threshold and failure rate for indicating slow responses. There are two parameters responsible for that: slowCallDurationThreshold and slowCallRateThreshold.

@Bean
public Customizer defaultCustomizer() {
    return factory -> factory.configureDefault(id -> new Resilience4JConfigBuilder(id)
        .circuitBreakerConfig(CircuitBreakerConfig.custom()
           .slidingWindowSize(5)
           .permittedNumberOfCallsInHalfOpenState(5)
           .failureRateThreshold(50.0F)
           .waitDurationInOpenState(Duration.ofMillis(50))
           .slowCallDurationThreshold(Duration.ofMillis(200))
           .slowCallRateThreshold(50.0F)
           .build())
        .build());
}

Now, the delayed responses are not finished with timeout, however the circuit breaker is still recording these records. When the threshold is exceeded the circuit breaker is open as shown below.

The post Circuit Breaking In Spring Cloud Gateway With Resilience4J appeared first on Piotr's TechBlog.

Spring Boot Best Practices for Microservices

piotr.minkowski — Fri, 06 Dec 2019 11:14:24 +0000

In this article I’m going to propose my list of “golden rules” for building Spring Boot applications, which are a part of a microservices-based system. I’m basing on my experience in migrating monolithic SOAP applications running on JEE servers into REST-based small applications built on top of Spring Boot. This list of Spring Boot best practices assumes you are running many microservices on the production under huge incoming traffic. Let’s begin.

1. Collect metrics

It is just amazing how metrics visualization can change an approach to the systems monitoring in the organization. After setting up monitoring in Grafana we are able to recognize more than 90% of bigger problems in our systems before they are reported by customers to our support team. Thanks to those two monitors with plenty of diagrams and alerts we may react much faster than earlier. If you have microservices-based architecture metrics become even more important than for monoliths.
The good news for us is that Spring Boot comes with a built-in mechanism for collecting the most important metrics. In fact, we just need to set some configuration properties to expose a predefined set of metrics provided by the Spring Boot Actuator. To use it we need to include Actuator starter as dependency:


    org.springframework.boot
    spring-boot-starter-actuator

To enable metrics endpoint we have to set property management.endpoint.metrics.enabled to true. Now you may check out the full list of generated metrics by calling endpoint GET /actuator/metrics. One of the most important metrics for us is http.server.requests, which provides statistics with the number of incoming requests and response time. It is automatically tagged with method type (POST, GET, etc.), HTTP status, and URI.
Metrics have to be stored somewhere. The most popular tools for that are InfluxDB and Prometheus. They are representing two different models of collecting data. Prometheus periodically retrieves data from the endpoint exposed by the application, while InfluxDB provides REST API that has to be called by the application. The integration with those two tools and several others is realized with the Micrometer library. To enable support for InfluxDB we have to include the following dependency.


    io.micrometer
    micrometer-registry-influx

We also have to provide at least URL and Influx database name inside application.yml file.

management:
  metrics:
    export:
      influx:
        db: springboot
        uri: http://192.168.99.100:8086

To enable Prometheus HTTP endpoint we first need to include the appropriate Micrometer module and also set property management.endpoint.prometheus.enabled to true.


    io.micrometer
    micrometer-registry-prometheus

By default, Prometheus tries to collect data from defined target endpoint once a minute. A rest of configuration has to be provided inside Prometheus. A scrape_config section is responsible for specifying a set of targets and parameters describing how to connect with them.

scrape_configs:
  - job_name: 'springboot'
    metrics_path: '/actuator/prometheus'
    static_configs:
    - targets: ['person-service:2222']

Sometimes it is useful to provide additional tags to metrics, especially if we have many instances of a single microservice that logs to a single Influx database. Here’s the sample of tagging for applications running on Kubernetes.

@Configuration
class ConfigurationMetrics {

    @Value("\${spring.application.name}")
    lateinit var appName: String
    @Value("\${NAMESPACE:default}")
    lateinit var namespace: String
    @Value("\${HOSTNAME:default}")
    lateinit var hostname: String

    @Bean
    fun tags(): MeterRegistryCustomizer {
        return MeterRegistryCustomizer { registry ->
            registry.config().commonTags("appName", appName).commonTags("namespace", namespace).commonTags("pod", hostname)
        }
    }

}

Here’s a diagram from Grafana created for http.server.requests metric of a single application.

2. Don’t forget about logging

Logging is something that is not very important during development, but is the key point during maintenance. It is worth to remember that in the organization your application would be viewed through the logs quality. Usually, an application is maintenanced by the support team, so your logs should be significant. Don’t try to put everything there, only the most important events should be logged.
It is also important to use the same standard of logging for all the microservices. For example, if you are logging information in JSON format, do the same for every single application. If you use tag appName for indicating application name or instanceId to distinguish different instances of the same application do it everywhere. Why? You usually want to store the logs collected from all microservices in a single, central place. The most popular tool for that (or rather the collection of tools) is Elastic Stack (ELK). To take advantage of storing logs in a central place, you should ensure that query criteria and response structure would be the same for all the applications, especially that you will correlate the logs between different microservices. How is that? Of course by using the external library. I can recommend my library for Spring Boot logging. To use it you should include it to your dependencies.


  com.github.piomin
  logstash-logging-spring-boot-starter
  1.2.2.RELEASE

This library will force you to use some good logging practices and automatically integrate with Logstash (one of three ELK tools responsible for collecting logs). Its main features are:

an ability to log all incoming HTTP requests and outgoing HTTP responses with full body, and send those logs to Logstash with the proper tags indicating calling method name or response HTTP status
it is able to calculate and store an execution time for each request
an ability to generate and propagate correlationId for downstream services calling with Spring RestTemplate

To enable sending logs to Logstash we should at least provide its address and property logging.logstash.enabled to true.

logging.logstash:
  enabled: true
  url: 192.168.99.100:5000

After including the library logstash-logging-spring-boot-starter you may take advantage of logs tagging in Logstash. Here’s the screen from Kibana for single response log entry.

We may also include Spring Cloud Sleuth library to our dependencies.

 
    org.springframework.cloud
    spring-cloud-starter-sleuth

Spring Cloud Sleuth propagates headers compatible with Zipkin – a popular tool for distributed tracing. Its main features are:

adding trace (correlating requests) and span IDs to the Slf4J MDC
recording timing information to aid in latency analysis
it modifies a pattern of log entry to add some informations like additional MDC fields
it provides integration with other Spring components like OpenFeign, RestTemplate or Spring Cloud Netflix Zuul

3. Make your API usable

In most cases, your application will be called by other applications through REST-based API. Therefore, it is worth taking care of proper and clear documentation. The documentation should be generated along with the code. Of course there are some tools for that. One of the most popular of them is Swagger. You can easily integrate Swagger 2 with your Spring Boot application using SpringFox project. In order to expose a Swagger HTML site with API documentation we need to include the following dependencies. The first library is responsible for generating Swagger descriptor from Spring MVC controllers code, while the second embeds Swagger UI to display representation of Swagger YAML descriptor in your web browser.


   io.springfox
   springfox-swagger2
   2.9.2


   io.springfox
   springfox-swagger-ui
   2.9.2

It’s not all. We also have to provide some beans to customize default Swagger generation behaviour. It should document only methods implemented inside our controllers, for example not the methods provided by Spring Boot automatically like /actuator/* endpoints. We may also customize UI appearance by defining UiConfiguration bean.

@Configuration
@EnableSwagger2
public class ConfigurationSwagger {

    @Autowired
    Optional build;

    @Bean
    public Docket api() {
        String version = "1.0.0";
        if (build.isPresent())
            version = build.get().getVersion();
        return new Docket(DocumentationType.SWAGGER_2)
                .apiInfo(apiInfo(version))
                .select()
                .apis(RequestHandlerSelectors.any())
                .paths(PathSelectors.regex("(/components.*)"))
                .build()
                .useDefaultResponseMessages(false)
                .forCodeGeneration(true);
    }

    @Bean
    public UiConfiguration uiConfig() {
        return UiConfigurationBuilder.builder().docExpansion(DocExpansion.LIST).build();
    }

    private ApiInfo apiInfo(String version) {
        return new ApiInfoBuilder()
                .title("API - Components Service")
                .description("Managing Components.")
                .version(version)
                .build();
    }
}

Here’s an example of Swagger 2 UI for a single microservice.

The next case is to define the same REST API guideline for all microservices. If you are building an API of your microservices consistently, it is much simpler to integrate with it for both external and internal clients. The guideline should contain instructions on how to build your API, which headers need to be set on the request and response, how to generate error codes etc. Such a guideline should be shared with all developers and vendors in your organization. For more detailed explanation of generating Swagger documentation for Spring Boot microservices including exposing it for all the application on API gateway you may refer to my article Microservices API Documentation with Swagger2.

4. Don’t afraid of using circuit breaker

If you are using Spring cloud for communication between microservices, you may leverage Spring Cloud Netflix Hystrix or Spring Cloud Circuit Breaker to implement circuit breaking. However, the first solution has been already moved to the maintenance mode by Pivotal team, since Netflix does not develop Hystrix anymore. The recommended solution is the new Spring Cloud Circuit Breaker built on top of resilience4j project.


   org.springframework.cloud
   spring-cloud-starter-circuitbreaker-resilience4j

Then we need to configure required settings for circuit breaker by defining Customizer bean that is passed a Resilience4JCircuitBreakerFactory. We are using default values as shown below.

@Bean
public Customizer defaultCustomizer() {
    return factory -> factory.configureDefault(id -> new Resilience4JConfigBuilder(id)
            .timeLimiterConfig(TimeLimiterConfig.custom().timeoutDuration(Duration.ofSeconds(5)).build())
            .circuitBreakerConfig(CircuitBreakerConfig.ofDefaults())
            .build());
}

For more details about integrating Hystrix circuit breaker with Spring Boot application you may refer to my article Part 3: Creating Microservices: Circuit Breaker, Fallback and Load Balancing with Spring Cloud.

5. Make your application transparent

Another important rule amongst Spring Boot best practices is transparency. We should not forget that one of the most important reasons for migration into microservices architecture is a requirement of continuous delivery. Today, the ability to deliver changes fast gives the advantage on the market. You should be able even to deliver changes several times during a day. Therefore, it is important what’s the current version, where it has been released and what changes it includes.
When working with Spring Boot and Maven we may easily publish such information like a date of last changes, Git commit id or numerous version of application. To achieve that we just need to include following Maven plugins to our pom.xml.


   
      org.springframework.boot
      spring-boot-maven-plugin
      
         
            
               build-info
            
         
      
   
   
      pl.project13.maven
      git-commit-id-plugin
      
         false

Assuming you have already included Spring Boot Actuator (see Section 1), you have to enable /info endpoint to be able to display all interesting data.


management.endpoint.info.enabled: true

Of course, we have many microservices consisting of our system, and there are a few running instances of every single microservice. It is desirable to monitor our instances in a single, central place – the same as with collecting metrics and logs. Fortunately, there is a tool dedicated for Spring Boot application, that is able to collect data from all Actuator endpoints and display them in UI. It is Spring Boot Admin developed by Codecentric. The most comfortable way to run it is by creating a dedicated Spring Boot application that includes Spring Boot Admin dependencies and integrates with a discovery server, for example Spring Cloud Netflix Eureka.


    de.codecentric
    spring-boot-admin-starter-server
    2.1.6


    org.springframework.cloud
    spring-cloud-starter-netflix-eureka-client

Then we should enable it for Spring Boot application by annotating the main class with @EnableAdminServer.

@SpringBootApplication
@EnableDiscoveryClient
@EnableAdminServer
@EnableAutoConfiguration
public class Application {
 
    public static void main(String[] args) {
        SpringApplication.run(Application.class, args);
    }
 
}

With Spring Boot Admin we may easily browse a list of applications registered in the discovery server and check out the version or commit info for each of them.

We can expand details to see all elements retrieved from /info endpoint and much more data collected from other Actuator endpoints.

6. Write contract tests

Consumer Driven Contract (CDC) testing is one of the methods that allows you to verify integration between applications within your system. The number of such interactions may be really large especially if you maintain microservices-based architecture. It is relatively easy to start with contract testing in Spring Boot thanks to the Spring Cloud Contract project. There are some other frameworks designed especially for CDC like Pact, but Spring Cloud Contract would probably be the first choice, since we are using Spring Boot.
To use it on the producer side we need to include Spring Cloud Contract Verifier.


    org.springframework.cloud
    spring-cloud-starter-contract-verifier
    test

On the consumer side we should include Spring Cloud Contract Stub Runner.



    org.springframework.cloud
    spring-cloud-starter-contract-stub-runner
    test

The first step is to define a contract. One of the options to write it is by using Groovy language. The contract should be verified on the both producer and consumer side. Here’s


import org.springframework.cloud.contract.spec.Contract
Contract.make {
    request {
        method 'GET'
        urlPath('/persons/1')
    }
    response {
        status OK()
        body([
            id: 1,
            firstName: 'John',
            lastName: 'Smith',
            address: ([
                city: $(regex(alphaNumeric())),
                country: $(regex(alphaNumeric())),
                postalCode: $(regex('[0-9]{2}-[0-9]{3}')),
                houseNo: $(regex(positiveInt())),
                street: $(regex(nonEmpty()))
            ])
        ])
        headers {
            contentType(applicationJson())
        }
    }
}

The contract is packaged inside the JAR together with stubs. It may be published to a repository manager like Artifactory or Nexus, and then consumers may download it from there during the JUnit test. Generated JAR file is suffixed with stubs.

@RunWith(SpringRunner.class)
@SpringBootTest(webEnvironment = WebEnvironment.NONE)
@AutoConfigureStubRunner(ids = {"pl.piomin.services:person-service:+:stubs:8090"}, consumerName = "letter-consumer",  stubsPerConsumer = true, stubsMode = StubsMode.REMOTE, repositoryRoot = "http://192.168.99.100:8081/artifactory/libs-snapshot-local")
@DirtiesContext
public class PersonConsumerContractTest {
 
    @Autowired
    private PersonClient personClient;
     
    @Test
    public void verifyPerson() {
        Person p = personClient.findPersonById(1);
        Assert.assertNotNull(p);
        Assert.assertEquals(1, p.getId().intValue());
        Assert.assertNotNull(p.getFirstName());
        Assert.assertNotNull(p.getLastName());
        Assert.assertNotNull(p.getAddress());
        Assert.assertNotNull(p.getAddress().getCity());
        Assert.assertNotNull(p.getAddress().getCountry());
        Assert.assertNotNull(p.getAddress().getPostalCode());
        Assert.assertNotNull(p.getAddress().getStreet());
        Assert.assertNotEquals(0, p.getAddress().getHouseNo());
    }
     
}

Contract testing will not verify sophisticated use cases in your microservices-based system. However, it is the first phase of testing interaction between microservices. Once you ensure the API contracts between applications are valid, you proceed to more advanced integration or end-to-end tests. For more detailed explanation of continuous integration with Spring Cloud Contract you may refer to my article Continuous Integration with Jenkins, Artifactory and Spring Cloud Contract.

7. Be up-to-date

Spring Boot and Spring Cloud relatively often release the new versions of their framework. Assuming that your microservices have a small codebase it is easy to up a version of used libraries. Spring Cloud releases new versions of projects using release train pattern, to simplify dependencies management and avoid problems with conflicts between incompatible versions of libraries.
Moreover, Spring Boot systematically improves startup time and memory footprint of applications, so it is worth updating it just because of that. Here’s the current stable release of Spring Boot and Spring Cloud.


   org.springframework.boot
   spring-boot-starter-parent
   2.2.1.RELEASE


   
      
         org.springframework.cloud
         spring-cloud-dependencies
         Hoxton.RELEASE
         pom
         import

Conclusion

I showed you that it is not hard to follow best practices with Spring Boot features and some additional libraries being a part of Spring Cloud. These Spring Boot best practices will make it easier for you to migrate into microservices-based architecture and also to run your applications in containers.

The post Spring Boot Best Practices for Microservices appeared first on Piotr's TechBlog.