# LanguageWire - DevOps Technical Challenge

This documentation contains the steps and differents actions I made to deploy merry_christmas application.

# Context

At LanguageWire we create our own neural models for machine translation. We combine these models with a runtime environment into Docker images. These images are deployed on a K8s cluster. We want to auto-scale the neural models when more translation traffic hits our APIs.

In this task, we want to assess your skills as a DevOps Engineer in building, deploying and auto-scaling containers.

# Tasks

We have coded a small Python REST endpoint that serves translations in some language when a GET request is made on <host>/lw/xmas/<target language>. The service supports translations for “Merry Christmas” in 10 languages. The code logs all API requests to stdout.

You can find the code here: https://github.com/Languagewire/merry_christmas

Make a simple solution to the following in GCP (or AWS/Azure):

  • Create a container image for this system and deploy it to a container registry.

  • Deploy an instance of this container to a managed K8s service.

  • Write a test script that calls the REST endpoint, 1 call/second

  • Configure the managed K8S service to autoscale the service instance to a second pod when the number of requests > 5 requests/second.

  • Change your test script so it now fires 10 requests/sec

  • Show proof that the K8S service has autoscaled to a 2nd pod

# Infrastrure Deploiement

I assume the platform is already deployed. In these firsts steps, I am using my own kubernetes test infrastructure running on Void Linux. The kubernetes cluster should be configured by one of these tools:

  • terraform (open-source): offer a common configuration file to deploy on AWS, GCP or Azure

  • salt-cloud (open-source): offer another method to deploy the platform based on salt pillars and states

  • ansible (open-source): offer another method to deploy the platform based on playbooks

  • AWS stack (private): offer a solution to deploy a full stack on AWS. Only compatible with Amazon services

  • AWS OpsWorks: use Chef or Puppet to manage and deploy services and infrastructure. Compatible with other providers and services.

# Docker Image Creation

A docker image is defined in a Dockerfile (opens new window), a template containing all requirement.

touch Dockerfile

This image will require these softwares or libraries:

  • Python 3.5+
  • Flask 1.1.1
  • Gunicorn 20.0.4
  • Jsonify

A good practice is to define interface to set variables. In our case, environment variables are enough and can be set with ENV instruction. Another good practice is to keep an image up to date by updating and upgrading it with the last available packages. In our case, Alpine Linux use apk package manage. The project is stored in a git repository (on github) and require git command line, we can install it with apk.

Now that we have an image with all our need, we can clone the repository (opens new window), install python module requirements directly from requirements.txt file, run the service and expose the TCP port defined in environment variables.

NOTE: pip and requirements.txt are here used to give the right version of the package instead of using them from apk repositories. If the developers create a new version, we can easily update it by creating a new container without using operating system package manager.

# use official python image
# see https://hub.docker.com/_/python
FROM python:3.6-alpine

# one worker by default
ENV GUNICORN_WORKER="1"

# listen on all interface by default
ENV LISTEN_HOST="0.0.0.0"

# listen on port 8080 by default
ENV LISTEN_PORT="8080"

# default extra gunicorn arguments
ENV GUNICORN_OPTS="--access-logfile - wsgi"

# update and upgrade packages
RUN apk update && apk upgrade

# install git
RUN apk add git

# clone official repository
RUN git clone https://github.com/Languagewire/merry_christmas

# install requirements
RUN pip install -r merry_christmas/requirements.txt

# run service and expose default port 
CMD cd merry_christmas && gunicorn -w ${GUNICORN_WORKER} -b ${LISTEN_HOST}:${LISTEN_PORT} ${GUNICORN_OPTS}
EXPOSE ${LISTEN_PORT}/tcp

Now, we can try to build our image by executing docker build command. This one will initialize our image by execution each instruction as a new container. Finally, we are optaining our image in the output.

docker build .

This image can be tested locally by using docker run or docker start. The first command will run the command and give you the output (stderr/stdout) of the running application.

docker run ${image_build_id}

The service seems to be up. To be tested, this method is not the best one, it assures your, at least, that the last command is correctly executed and does not generate error.

[2020-03-03 11:34:55 +0000] [1] [INFO] Starting gunicorn 20.0.4
[2020-03-03 11:34:55 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)
[2020-03-03 11:34:55 +0000] [1] [INFO] Using worker: sync
[2020-03-03 11:34:55 +0000] [7] [INFO] Booting worker with pid: 7
[2020-03-03 11:34:56 +0000] [1] [INFO] Handling signal: winch

The next method will create a container directly from the image previously builded by using docker create and docker start commands

docker create ${image_build_id}
docker start ${image_build_id}

The last command will return the container id (normally running) on the host. With this idea, we can now inspect the container, take the ip address of this container and use curl to test if the service is up and running.

docker inspect ${container_id} \
    | jq '.[0].NetworkSettings.Networks.bridge.IPAddress \
    | xargs -I%i curl %i:8080/lw/xmas/de
# return Healthy!

docker inspect ${container_id} \
    | jq '.[0].NetworkSettings.Networks.bridge.IPAddress \
    | xargs -I%i curl %i:8080/lw/xmas/de
# return "Fröhliche weihnachten"

Our Container is working and ready to be stored in a registry.

NOTE: A good practice is to create a repository containing the Dockerfile and documentation (or add directly the Dockerfile in the main repository of the project). By using this method, we can easily create our CI/CD pipeline by only cloning the repository, build the image on the fly, and store it in our Docker Registry. The CI/CD pipeline can be executed based on the repository modification or by webhook.

# Docker Registry

AWS, GCP or Azure offer services to manage public and private Docker Registries.

Another solution is to manage our own private registry by using official Docker Registry (opens new window) image. But this part is outside of the scope of this exercise.

# Google Cloud Bootstrapping

Google Container Registry require to configure docker by creating (or using) credentials. Google offer docker-credential-gcr (opens new window) tool to help people to configure this access.

mkdir ${HOME}/bin
export PATH=${PATH}:${HOME}/bin
cd ${HOME}/bin
curl -O - https://github.com/GoogleCloudPlatform/docker-credential-gcr/releases/download/v2.0.0/docker-credential-gcr_linux_amd64-2.0.0.tar.gz \
    | tar zxvf
chmod u+x docker-credential-gcr

Another requirement is gcloud command line. Unfortunately, this tool is not packaged for some "weird" version of linux, so, I will use the source to install it.

mkdir ${HOME}/src
cd ${HOME}/src
wget https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-282.0.0-linux-x86_64.tar.gz
tar zxvf google-cloud-sdk-282.0.0-linux-x86_64.tar.gz
cd google-cloud-sdk
./install.sh

# Google Cloud Credentials

To access the different services with gcloud CLI, we first need to authenticate to google by using our credentials (by default, mail account). A token is generated and will be used for the rest of the challenge.

gcloud auth login

We can ensure our user is authenticated by using gcloud auth command again. This command will list all active accounts.

gcloud auth list

Next, the project need to be set by using gcloud config set project command.

gcloud config set project technical-challenges

# Pushing Docker Images

The first thing is to list all available repository and select the one used for the challenge (or the one with the right to push and pull). This repository is called gcr.io/technical-challenges/languagewire.

gcloud container images list

A tag is needed before pushing an image. This tag is a reference to the remote repository with a name. In this case, the previously builded image will be used.

docker tag ${image_build_id} gcr.io/technical-challenges/langagewire:latest

This image can now be pushed to the repository.

docker push gcr.io/technical-challenges/langagewire:latest

The image should now be available on the repository, we can control it by using the gcloud command again to list avaiable tags.

gcloud container images list-tags gcr.io/technical-challenges/languagewire

NOTE: if you used terraform to deploy the cluster, ensure that this cluster is allowed to push/pull image by configuring right OAuth access else you will generate an error everytime kubernetes will try to pull an image from your private registry.

# Deploying Kubernetes Cluster

Before doing anything, retrieve the different available regions is a good idea.

gcloud compute regions list

Furthermore, retrieve information from one or more regions to see if the credentials are enough privileges, or if the region has enough resources.

gcloud compute regions describe europe-north1

# Terraform Deployment

Terraform need a dedicated directory where all the configuration will be stored.

mkdir gcp
touch gcp/provider.tf
touch gcp/kubernetes.tf

provider.tf contains all information used to connect to google cloud service like credentials. acount.json is generated from google credentials API (opens new window). This file contains sensitive information and should be protected. You should also add enough privilege to the role to permit the creation of the different resources needed.

provider "google" {
    credentials = "${file("account.json")}"
    project     = "tc-mathieu-kerjouan"
    region      = "us-central1"
    zone        = "us-central1-a"
}

Next, the resource to be deployed will be stored in kubernetes.tf file and will define a simple cluster with 1 node. This code comes from the official documentation.

resource "google_container_cluster" "xmas-cluster" {
  name     = "merry-christmas"
  location = "us-central1"

  remove_default_node_pool = true
  initial_node_count       = 1

  master_auth {
    username = ""
    password = ""

    client_certificate_config {
      issue_client_certificate = false
    }
  }
}

resource "google_container_node_pool" "xmas-cluster-pool" {
  name       = "xmas-cluster-pool"
  location   = "us-central1"
  cluster    = google_container_cluster.primary.name
  node_count = 1

  node_config {
    preemptible  = false
    machine_type = "n1-standard-1"

    metadata = {
      disable-legacy-endpoints = "true"
    }

    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/devstorage.read_only"
    ]
  }
}

terraform is now configured and the deployment require the execution of two commands, terraform init (fetch the providers) and terraform apply used to apply the change to the remote point.

cd gcp
terraform init
terraform apply

NOTE: during deployment I had some issue with the project name. You can easily debug terraform by configuring TF_LOG environment variable to TRACE or DEBUG.

# Spawning Containers on Kubernetes

To ensure terraform created everything correctly, a good way is to list the available clusters with gcloud or via the GCP Console.

gcloud container clusters list

Finally, to ensure if the cluster was correctly deployed with right information, gcloud command can describe this resource based on the cluster name and the region or zone.

gcloud container clusters describe xmas-cluster --region=us-central1

When everything seems good, kubectl can be configured locally by executing get-credentials sub-command. This command will create a kubeconfig stored by default in ${HOME}/kube/config.

gcloud container clusters get-credentials xmas-cluster --region=us-central1

Now, kubectl should be correctly configured and should point to xmas-cluster on GCP. kubectl describe sub-command can give more information about the kubernetes state.

kubectl describe nodes

Kubernetes uses manifest file to deploy the different configuration layers (e.g. services, endpoints or pods). Those files are wrote in YAML or JSON. This manifest will be cut in 3 parts.

First part is about the deployment, in this manifest we pull the image and configure the metadata about xmas container.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: xmas
spec:
  selector:
    matchLabels:
      run: xmas
  replicas: 3
  template:
    metadata:
      labels:
        run: xmas
    spec:
      containers:
      - name: xmas
        image: gcr.io/tc-mathieu-kerjouan/xmas
        ports:
        - containerPort: 8080

This code can be applied by execute kubectl apply

kubectl apply -f deploy.yaml

The second part is about the service and permit to give access to the service from the container (running in a specific port). In this example, a simple LoadBalancer will be deployed.

apiVersion: v1
kind: Service
metadata:
  name: xmas
spec:
  type: LoadBalancer
  externalTrafficPolicy: Cluster
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    run: xmas

This code, like the last one, can be applied by executing kubectl apply.

kubectl apply -f service.yml

The service should now be available, to see if it is the case, the kubectl describe can return us many information about it.

kubectl describe service xmas

The list of all external IP used by our cluster (and by our service service xmas) can be found by using a simple filter with jq.

kubectl get service xmas -o json \
    | jq ".status.loadBalancer.ingress[].ip"

Or by using a filter directly from kubectl command.

kubectl get service xmas -o custom-columns="SERVICE:.status.loadBalancer.ingress[].ip

In this example, the service got the external IP address 35.184.198.254 and xmas service can be checked with curl.

curl http://35.184.198.254:8080/
# return 
# Healthy!

curl http://35.184.198.254:8080/lw/xmas/en
# return
# "Merry christmas"

NOTE: It seems some inbound firewall rules should be applied on the cluster to allow traffic from different ports. In this case, a firewall rule can be created with gcloud command

gcloud compute firewall-rules create xmas --allow tcp:8080
  • https://cloud.google.com/container-registry/docs/using-with-google-cloud-platform

# Ingress Load Balancing

Due to issue with some metrics, I decided to use an ingress configuration instead of classical load balancer. In this case, I also decided to rename some services and make think more accessible to others. The first step is clean existant configuration, to ensure that those configuration will not disturb our experiment.

kubectl delete service xmas

Ingress configuration need a NodePort, a node port will allow pods to share a port (in our case TCP/8080).

apiVersion: v1
kind: Service
metadata:
  name: xmas-web 
  namespace: default
spec:
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    run: xmas
  type: NodePort

To deploy it, use kubectl apply command

kubectl apply -f xmas-service.yaml

Ingress configuration can be set now.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: basic-ingress
spec:
  backend:
    serviceName: xmas-ingress
    servicePort: 8080

As usual, kubectl apply can be executed to set this configuration.

kubectl apply -f xmas-ingress.yaml

Ingress does not work right now, because GKE need to link an external IP address to it. Wait few seconds (~30s on my side) to obtain the external IP.

kubectl get ingress xmas-ingress

The service is now reachable on the external IP.

curl ${ingress_ip}

# Test script

Generate a connection testing on a web service can be done in many different ways.

# Autoscaling

Horizontal Pod Scaling fetches metrics from these APIs

  • metrics.k8s.io
  • custom.metrics.k8s.io
  • external.metrics.k8s.io
kubectl get --raw '/apis/custom.metrics.k8s.io/v1beta1/' \
    | jq . \
    | less

Kubernetes offers different versions of autoscaling. In this technical challenge I will deploy a standard Horizontal Pod Autoscaler (HPA) with CPU and RAM.

NOTE: due to some issues with custom metrics, I decided to try all the sub-layer to ensure everything was working correctly.

Autoscaling require to configure GKE accounts and install pods in kubernetes. You should also ensure that your kubernetes cluster is correctly configured to allow monitoring and logging. This last part can be configured directly from the console or with gcloud CLI.

# Standard Metrics AutoScaling

By default and with the v1 API, kubernetes can easily retrieve CPU and RAM usage. To activate autoscaling, a HPA resource need to be created.

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: xmas 
spec:
  scaleTargetRef:
    apiVersion: extensions/v1beta1
    kind: Deployment
    name: xmas 
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 80
  - type: Resource
    resource:
      name: memory
      targetAverageValue: 200Mi

To create it, as usual, kubectl apply can be executed.

kubectl apply -f xmas-simple-hpa.yaml

A new HPA should be available now.

kubectl get hpa

will output

NAME   REFERENCE         TARGETS                  MINPODS   MAXPODS   REPLICAS   AGE
xmas   Deployment/xmas   29601792/200Mi, 1%/80%   2         10        2          44m

And more information available with a describe.

kubectl describe hpa xmas

will output

Name:                     xmas
Namespace:                default
Labels:                   <none>
Annotations:              autoscaling.alpha.kubernetes.io/conditions:
                            [{"type":"AbleToScale","status":"True","lastTransitionTime":"2020-03-05T07:34:41Z","reason":"ReadyForNewScale","message":"recommended size...
                          autoscaling.alpha.kubernetes.io/current-metrics:
                            [{"type":"Resource","resource":{"name":"memory","currentAverageValue":"29601792"}},{"type":"Resource","resource":{"name":"cpu","currentAve...
                          autoscaling.alpha.kubernetes.io/metrics: [{"type":"Resource","resource":{"name":"memory","targetAverageValue":"200Mi"}}]
                          kubectl.kubernetes.io/last-applied-configuration:
                            {"apiVersion":"autoscaling/v2beta1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"xmas","namespace":"default"},"sp...
CreationTimestamp:        Thu, 05 Mar 2020 08:34:39 +0100
Reference:                Deployment/xmas
Target CPU utilization:   80%
Current CPU utilization:  1%
Min replicas:             2
Max replicas:             10
Deployment pods:          2 current / 2 desired

# Custom Metrics Autoscaling

It seems that autoscaling with custom metrics is a recent technology and can change easily between different release and cloud provider. Due to that, the documentation available is sometime incoherent with many typos.

Custom Metrics were designed in 2017/2018 and presented at the CNCF Conference (opens new window) in 2018 by Solly Ross and Maciej Pytel. In this talk, they are talking about the different autoscaling feature, and describe some usage of pod metrics, custom metrics and external metrics. The original PoC is available on official Solly Ross Github (opens new window) and the official version available on the official kubernetes repository (opens new window) on github.

Deploying with custom metrics on GKE require account and cluster configuration (opens new window). Stackdriver must be available and enabled. Your user must have enough right.

# from official documentation
kubectl create clusterrolebinding cluster-admin-binding \
    --clusterrole cluster-admin --user "$(gcloud config get-value account)"

An adapter must be deployed to access and push metrics on Stackdriver.

# from official documentation
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml

Normally, at this step, "custom.metrics.k8s.io" should be available and you should see a list of metrics type.

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" \
    | jq .

At the same time, you can also check on the Google Cloud Monitoring console (opens new window).

Here the full resouce list used to debug and try to understand why this was not working:

  • https://blog.jetstack.io/blog/resource-and-custom-metrics-hpa-v2/
  • https://blog.kloia.com/kubernetes-hpa-externalmetrics-prometheus-acb1d8a4ed50
  • https://dzone.com/articles/ensure-high-availability-and-uptime-with-kubernete
  • https://dzone.com/articles/kubernetes-autoscaling-explained
  • https://github.com/kubernetes/community/blob/master/contributors/design-proposals/autoscaling/hpa-v2.md
  • https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/custom-metrics-api.md
  • https://github.com/kubernetes/kube-state-metrics
  • https://github.com/kubernetes/metrics/blob/master/IMPLEMENTATIONS.md#custom-metrics-api
  • https://github.com/stefanprodan/k8s-prom-hpa
  • https://istio.io/blog/2017/0.1-canary/
  • https://itnext.io/kubernetes-workers-autoscaling-based-on-rabbitmq-queue-size-cb0803193cdf
  • https://jakubbujny.com/2018/10/07/container-and-cluster-scaling-on-kubernetes-using-horizontal-pod-autoscaler-and-cluster-autoscaler-on-aws-eks/
  • https://koudingspawn.de/kubernetes-autoscaling/
  • https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-multiple-metrics-and-custom-metrics
  • https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
  • https://medium.com/titansoft-engineering/hpa-failed-to-work-on-gke-and-how-we-fixed-it-babdef32e4fc
  • https://stefanprodan.com/2018/kubernetes-horizontal-pod-autoscaler-prometheus-metrics/
  • https://www.ibm.com/support/knowledgecenter/SSBS6K_2.1.0.3/manage_cluster/hpa.html

Unfortunately, it seems I missed something, and the solution does not work with GKE.

# Local Script

A simple script to check a multi-connection:

#!/bin/sh

_check_connection() {
    local max target
    target="${1}"
    max="${2}"
    test ! "${target}" \
        && echo "target required" \
        && return 1
    test ! ${max} \
        || max=1
    seq 1 ${max} \
        | xargs -I%i -P${max} curl "${target}"
}

_check_connection ${*}

This script uses seq, xargs and curl to generate a flow of parallel connection based on a target and a number of connections. Assuming this script is called "check.sh".

./check.sh ${target} ${connection_count}

To check 10 connections each second, a simple while loop can solve this issue.

while ./check.sh ${target} 10;
do
  sleep 1
done

If the remote site become unreachable, the loop stop.

# Script in Container

With the last script created, it gives a local solution to solve this problem. We can reuse this script and put it in a container, you can use curlimages container (opens new window) and modify it or you can also use different tools like:

# Kubernetes Autoscaling

# Monitoring Application

To manually check if a pod is scaling, you can use gcloud command or use google cloud console monitoring. In both case, the numbers of running containers should grow based on the different conditions defined previously.

# Troubleshooting

# Permissions

A permission issues appears during container creation via gcloud cli. The name was changed many time and used different kind of configuration, without success.

gcloud container clusters create merry-christmas --zone us-central1-c

output:

ERROR: (gcloud.container.clusters.create) ResponseError: code=403, message=Google Compute Engine: Required 'compute.networks.get' permission for 'projects/technical-challenges/global/networks/default'

Same issue with the console when deploying on different zone and different cluster size.

Google Compute Engine: Required 'compute.zones.get' permission for 'projects/technical-challenges/zones/us-central1-a'.

Simular issue with Network permissions:

Google Compute Engine: Required 'compute.networks.get' permission for 'projects/technical-challenges/global/networks/default'. 

# Check IAM Permissions

gcloud projects get-iam-policy technical-challenges
gcloud iam roles describe roles/owner | grep compute.zones
# - compute.zones.get
# - compute.zones.list

# Check Regions Available

gcloud compute regions list

output:

NAME                     CPUS  DISKS_GB  ADDRESSES  RESERVED_ADDRESSES  STATUS  TURNDOWN_DATE
asia-east1               0/72  0/40960   0/69       0/21                UP
asia-east2               0/72  0/40960   0/69       0/21                UP
asia-northeast1          0/72  0/40960   0/69       0/21                UP
asia-northeast2          0/24  0/4096    0/8        0/8                 UP
asia-northeast3          0/24  0/4096    0/8        0/8                 UP
asia-south1              0/72  0/40960   0/69       0/21                UP
asia-southeast1          0/72  0/40960   0/69       0/21                UP
australia-southeast1     0/72  0/40960   0/69       0/21                UP
europe-north1            0/24  0/4096    0/8        0/8                 UP
europe-west1             0/72  0/40960   0/69       0/21                UP
europe-west2             0/72  0/40960   0/69       0/21                UP
europe-west3             0/24  0/4096    0/8        0/8                 UP
europe-west4             0/72  0/40960   0/69       0/21                UP
europe-west6             0/24  0/4096    0/8        0/8                 UP
northamerica-northeast1  0/72  0/40960   0/69       0/21                UP
southamerica-east1       0/72  0/40960   0/69       0/21                UP
us-central1              0/72  0/40960   0/69       0/21                UP
us-east1                 0/72  0/40960   0/69       0/21                UP
us-east4                 0/72  0/40960   0/69       0/21                UP
us-west1                 0/72  0/40960   0/69       0/21                UP
us-west2                 0/72  0/40960   0/69       0/21                UP
us-west3                 0/24  0/4096    0/8        0/8                 UP

So, It seems all regions have enough ressources to offer.

# Create a manual instance

work as expected.

# Check on a fresh created account

work as expected.

# Terraform deployment issue

Error 500 is returned with kubernetes service due to project name and/or when the zone/region is not configured.

# FAQ

# How to debug prometheus?

A prometheus pod was created after the configuration of custom metrics and external metrics from the official Google Cloud. This pod is running in monitoring namespace

kubectl -n monitoring get pods

Take the ID of the pod, and read the logs

kubectl -n monitoring logs prometheus-6b76966578-wggzx

It will output something like that:

level=info ts=2020-03-04T20:15:37.806525673Z caller=main.go:225 msg="Starting Prometheus" version="(version=2.1.0, branch=HEAD, revision=85f23d82a045d103ea7f3c89a91fba4a93e6367a)"
level=info ts=2020-03-04T20:15:37.806868433Z caller=main.go:226 build_context="(go=go1.9.2, user=root@6e784304d3ff, date=20180119-12:01:23)"
level=info ts=2020-03-04T20:15:37.807039003Z caller=main.go:227 host_details="(Linux 4.14.138+ #1 SMP Tue Sep 3 02:58:08 PDT 2019 x86_64 prometheus-6b76966578-wggzx (none))"
level=info ts=2020-03-04T20:15:37.807125509Z caller=main.go:228 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2020-03-04T20:15:37.812151155Z caller=main.go:499 msg="Starting TSDB ..."
level=info ts=2020-03-04T20:15:37.824408896Z caller=web.go:383 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-03-04T20:15:37.834619663Z caller=main.go:509 msg="TSDB started"
level=info ts=2020-03-04T20:15:37.834873416Z caller=main.go:585 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2020-03-04T20:15:37.836629348Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-03-04T20:15:37.837682764Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-03-04T20:15:37.838433309Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-03-04T20:15:37.839227596Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-03-04T20:15:37.840029983Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-03-04T20:15:37.840654082Z caller=main.go:486 msg="Server is ready to receive web requests."
level=info ts=2020-03-04T20:15:37.842866397Z caller=manager.go:59 component="scrape manager" msg="Starting scrape manager..."
level=info ts=2020-03-04T23:00:04.538488734Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1583352000000 maxt=1583359200000
level=info ts=2020-03-04T23:00:06.073084563Z caller=head.go:348 component=tsdb msg="head GC completed" duration=50.311649ms
level=info ts=2020-03-04T23:00:06.073409207Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=3.169µs
level=info ts=2020-03-05T01:00:04.44766283Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1583359200000 maxt=1583366400000
level=info ts=2020-03-05T01:00:06.024622789Z caller=head.go:348 component=tsdb msg="head GC completed" duration=52.554305ms
level=info ts=2020-03-05T01:00:06.506285305Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=481.307648ms
level=info ts=2020-03-05T03:00:04.55516356Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1583366400000 maxt=1583373600000
level=info ts=2020-03-05T03:00:06.157425284Z caller=head.go:348 component=tsdb msg="head GC completed" duration=50.882162ms
level=info ts=2020-03-05T03:00:06.653491516Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=495.78152ms
level=info ts=2020-03-05T05:00:04.554785703Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1583373600000 maxt=1583380800000
level=info ts=2020-03-05T05:00:06.079075114Z caller=head.go:348 component=tsdb msg="head GC completed" duration=62.22199ms
level=info ts=2020-03-05T05:00:06.387626217Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=308.240291ms
level=info ts=2020-03-05T07:00:04.813504837Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1583380800000 maxt=1583388000000
level=info ts=2020-03-05T07:00:06.407791167Z caller=head.go:348 component=tsdb msg="head GC completed" duration=55.376979ms
level=info ts=2020-03-05T07:00:06.900724634Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=492.641312ms
level=info ts=2020-03-05T09:00:00.434882319Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1583388000000 maxt=1583395200000
level=info ts=2020-03-05T09:00:01.888288174Z caller=head.go:348 component=tsdb msg="head GC completed" duration=50.724587ms
level=info ts=2020-03-05T09:00:02.390778444Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=502.175209ms

You can also access the configuration by using kubectl exec.

kubectl -n monitoring exec prometheus-6b76966578-wggzx cat /etc/prometheus/prometheus.yml

# How to list all available metrics?

On Google Cloud offers you a long list of monitoring resource, available here:

  • https://console.cloud.google.com/monitoring/dashboards/resourceList/kubernetes

Standard metrics are available directly from "metrics.k8s.io" API endpoint.

kubectl get --raw "/apis/metrics.k8s.io/v1beta1/" \
    | jq .

On kubernetes, it's a bit harder see all available resources with monitoring support. With GKE, and after the configuration of your cluster with prometheus, you should have access to different REST endpoint giving you a full list of metrics.

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" \
    | jq .

If you have external defined resources to monitor, you can list them with "external.metrics.k8s.io" endpoint

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/" \
    | jq .

NOTE: many versions are available by querying only "/apis/external.metrics.k8s.io/" for example.

# Resources

# GCP

  • https://github.com/GoogleCloudPlatform/docker-credential-gcr
  • https://github.com/GoogleCloudPlatform/docker-credential-gcr/releases
  • https://cloud.google.com/sdk/docs/downloads-versioned-archives
  • https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler
  • https://cloud.google.com/kubernetes-engine/docs/how-to/horizontal-pod-autoscaling
  • https://cloud.google.com/blog/products/containers-kubernetes/using-advanced-kubernetes-autoscaling-with-vertical-pod-autoscaler-and-node-auto-provisioning
  • https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-more-specific-metrics
  • https://medium.com/google-cloud/kubernetes-autoscaling-with-istio-metrics-76442253a45a
  • https://docs.bitnami.com/kubernetes/how-to/configure-autoscaling-custom-metrics/
  • https://blog.doit-intl.com/autoscaling-k8s-hpa-with-google-http-s-load-balancer-rps-stackdriver-metric-92db0a28e1ea
  • https://cloud.google.com/monitoring/api/metrics_gcp
  • https://cloud.google.com/monitoring/kubernetes-engine#metrics_explorer

# Authentication

  • https://cloud.google.com/docs/authentication/production

# Registry

  • https://cloud.google.com/container-registry/docs/pushing-and-pulling
  • https://cloud.google.com/container-registry/pricing?hl=fr
  • https://cloud.google.com/container-registry/docs/advanced-authentication?hl=fr
  • https://cloud.google.com/container-registry/docs/using-with-google-cloud-platform

# Kubernetes

  • https://kubernetes.io/docs/reference/kubectl/cheatsheet/
  • https://cloud.google.com/kubernetes-engine/docs/
  • https://cloud.google.com/kubernetes-engine/docs/how-to/creating-a-cluster
  • https://cloud.google.com/kubernetes-engine/docs/concepts/deployment
  • https://kubernetes.io/docs/tutorials/stateless-application/expose-external-ip-address/
  • https://cloud.google.com/kubernetes-engine/docs/how-to/exposing-apps
  • https://v1-13.docs.kubernetes.io/docs/concepts/services-networking/service/
  • https://kubernetes.io/docs/concepts/services-networking/service/

# Terraform

  • https://cloud.google.com/community/tutorials/managing-gcp-projects-with-terraform

  • https://console.cloud.google.com/apis/credentials

  • https://www.terraform.io/docs/providers/google/guides/provider_reference.html

  • https://www.hashicorp.com/blog/managing-kubernetes-applications-with-hashicorp-terraform/

  • https://www.terraform.io/docs/providers/kubernetes/r/pod.html