# LanguageWire - DevOps Technical Challenge
This documentation contains the steps and differents actions I made to
deploy merry_christmas application.
# Context
At LanguageWire we create our own neural models for machine translation. We combine these models with a runtime environment into Docker images. These images are deployed on a K8s cluster. We want to auto-scale the neural models when more translation traffic hits our APIs.
In this task, we want to assess your skills as a DevOps Engineer in building, deploying and auto-scaling containers.
# Tasks
We have coded a small Python REST endpoint that serves translations in
some language when a GET request is made on <host>/lw/xmas/<target language>. The service supports translations for “Merry Christmas” in
10 languages. The code logs all API requests to stdout.
You can find the code here: https://github.com/Languagewire/merry_christmas
Make a simple solution to the following in GCP (or AWS/Azure):
Create a container image for this system and deploy it to a container registry.
Deploy an instance of this container to a managed K8s service.
Write a test script that calls the REST endpoint, 1 call/second
Configure the managed K8S service to autoscale the service instance to a second pod when the number of requests > 5 requests/second.
Change your test script so it now fires 10 requests/sec
Show proof that the K8S service has autoscaled to a 2nd pod
# Infrastrure Deploiement
I assume the platform is already deployed. In these firsts steps, I am using my own kubernetes test infrastructure running on Void Linux. The kubernetes cluster should be configured by one of these tools:
terraform (open-source): offer a common configuration file to deploy on AWS, GCP or Azure
salt-cloud (open-source): offer another method to deploy the platform based on salt pillars and states
ansible (open-source): offer another method to deploy the platform based on playbooks
AWS stack (private): offer a solution to deploy a full stack on AWS. Only compatible with Amazon services
AWS OpsWorks: use Chef or Puppet to manage and deploy services and infrastructure. Compatible with other providers and services.
# Docker Image Creation
A docker image is defined in a
Dockerfile (opens new window), a
template containing all requirement.
touch Dockerfile
This image will require these softwares or libraries:
- Python 3.5+
- Flask 1.1.1
- Gunicorn 20.0.4
- Jsonify
A good practice is to define interface to set variables. In our case,
environment variables are enough and can be set with ENV
instruction. Another good practice is to keep an image up to date by
updating and upgrading it with the last available packages. In our
case, Alpine Linux use apk package manage. The project is stored in
a git repository (on github) and require git command line, we can
install it with apk.
Now that we have an image with all our need, we can clone the
repository (opens new window), install
python module requirements directly from requirements.txt file, run
the service and expose the TCP port defined in environment variables.
NOTE: pip and requirements.txt are here used to give the right
version of the package instead of using them from apk
repositories. If the developers create a new version, we can easily
update it by creating a new container without using operating system
package manager.
# use official python image
# see https://hub.docker.com/_/python
FROM python:3.6-alpine
# one worker by default
ENV GUNICORN_WORKER="1"
# listen on all interface by default
ENV LISTEN_HOST="0.0.0.0"
# listen on port 8080 by default
ENV LISTEN_PORT="8080"
# default extra gunicorn arguments
ENV GUNICORN_OPTS="--access-logfile - wsgi"
# update and upgrade packages
RUN apk update && apk upgrade
# install git
RUN apk add git
# clone official repository
RUN git clone https://github.com/Languagewire/merry_christmas
# install requirements
RUN pip install -r merry_christmas/requirements.txt
# run service and expose default port
CMD cd merry_christmas && gunicorn -w ${GUNICORN_WORKER} -b ${LISTEN_HOST}:${LISTEN_PORT} ${GUNICORN_OPTS}
EXPOSE ${LISTEN_PORT}/tcp
Now, we can try to build our image by executing docker build
command. This one will initialize our image by execution each
instruction as a new container. Finally, we are optaining our image in
the output.
docker build .
This image can be tested locally by using docker run or docker start. The first command will run the command and give you the output
(stderr/stdout) of the running application.
docker run ${image_build_id}
The service seems to be up. To be tested, this method is not the best one, it assures your, at least, that the last command is correctly executed and does not generate error.
[2020-03-03 11:34:55 +0000] [1] [INFO] Starting gunicorn 20.0.4
[2020-03-03 11:34:55 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)
[2020-03-03 11:34:55 +0000] [1] [INFO] Using worker: sync
[2020-03-03 11:34:55 +0000] [7] [INFO] Booting worker with pid: 7
[2020-03-03 11:34:56 +0000] [1] [INFO] Handling signal: winch
The next method will create a container directly from the image
previously builded by using docker create and docker start commands
docker create ${image_build_id}
docker start ${image_build_id}
The last command will return the container id (normally running) on
the host. With this idea, we can now inspect the container, take the
ip address of this container and use curl to test if the service is
up and running.
docker inspect ${container_id} \
| jq '.[0].NetworkSettings.Networks.bridge.IPAddress \
| xargs -I%i curl %i:8080/lw/xmas/de
# return Healthy!
docker inspect ${container_id} \
| jq '.[0].NetworkSettings.Networks.bridge.IPAddress \
| xargs -I%i curl %i:8080/lw/xmas/de
# return "Fröhliche weihnachten"
Our Container is working and ready to be stored in a registry.
NOTE: A good practice is to create a repository containing the
Dockerfile and documentation (or add directly the Dockerfile in the
main repository of the project). By using this method, we can easily
create our CI/CD pipeline by only cloning the repository, build the
image on the fly, and store it in our Docker Registry. The CI/CD
pipeline can be executed based on the repository modification or by
webhook.
# Docker Registry
AWS, GCP or Azure offer services to manage public and private Docker Registries.
- AWS ECR (opens new window)
- GCP Container Registry (opens new window)
- Azure Container Registry (opens new window)
Another solution is to manage our own private registry by using official Docker Registry (opens new window) image. But this part is outside of the scope of this exercise.
# Google Cloud Bootstrapping
Google Container Registry require to configure docker by creating (or
using) credentials. Google offer
docker-credential-gcr (opens new window)
tool to help people to configure this access.
mkdir ${HOME}/bin
export PATH=${PATH}:${HOME}/bin
cd ${HOME}/bin
curl -O - https://github.com/GoogleCloudPlatform/docker-credential-gcr/releases/download/v2.0.0/docker-credential-gcr_linux_amd64-2.0.0.tar.gz \
| tar zxvf
chmod u+x docker-credential-gcr
Another requirement is gcloud command line. Unfortunately, this tool
is not packaged for some "weird" version of linux, so, I will use the
source to install it.
mkdir ${HOME}/src
cd ${HOME}/src
wget https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-282.0.0-linux-x86_64.tar.gz
tar zxvf google-cloud-sdk-282.0.0-linux-x86_64.tar.gz
cd google-cloud-sdk
./install.sh
# Google Cloud Credentials
To access the different services with gcloud CLI, we first need to
authenticate to google by using our credentials (by default, mail
account). A token is generated and will be used for the rest of the
challenge.
gcloud auth login
We can ensure our user is authenticated by using gcloud auth command
again. This command will list all active accounts.
gcloud auth list
Next, the project need to be set by using gcloud config set project
command.
gcloud config set project technical-challenges
# Pushing Docker Images
The first thing is to list all available repository and select the one
used for the challenge (or the one with the right to push and
pull). This repository is called
gcr.io/technical-challenges/languagewire.
gcloud container images list
A tag is needed before pushing an image. This tag is a reference to the remote repository with a name. In this case, the previously builded image will be used.
docker tag ${image_build_id} gcr.io/technical-challenges/langagewire:latest
This image can now be pushed to the repository.
docker push gcr.io/technical-challenges/langagewire:latest
The image should now be available on the repository, we can control it
by using the gcloud command again to list avaiable tags.
gcloud container images list-tags gcr.io/technical-challenges/languagewire
NOTE: if you used terraform to deploy the cluster, ensure that this cluster is allowed to push/pull image by configuring right OAuth access else you will generate an error everytime kubernetes will try to pull an image from your private registry.
# Deploying Kubernetes Cluster
Before doing anything, retrieve the different available regions is a good idea.
gcloud compute regions list
Furthermore, retrieve information from one or more regions to see if the credentials are enough privileges, or if the region has enough resources.
gcloud compute regions describe europe-north1
# Terraform Deployment
Terraform need a dedicated directory where all the configuration will be stored.
mkdir gcp
touch gcp/provider.tf
touch gcp/kubernetes.tf
provider.tf contains all information used to connect to google cloud
service like credentials. acount.json is generated from google
credentials
API (opens new window). This file
contains sensitive information and should be protected. You should
also add enough privilege to the role to permit the creation of the
different resources needed.
provider "google" {
credentials = "${file("account.json")}"
project = "tc-mathieu-kerjouan"
region = "us-central1"
zone = "us-central1-a"
}
Next, the resource to be deployed will be stored in kubernetes.tf
file and will define a simple cluster with 1 node. This code comes
from the official documentation.
resource "google_container_cluster" "xmas-cluster" {
name = "merry-christmas"
location = "us-central1"
remove_default_node_pool = true
initial_node_count = 1
master_auth {
username = ""
password = ""
client_certificate_config {
issue_client_certificate = false
}
}
}
resource "google_container_node_pool" "xmas-cluster-pool" {
name = "xmas-cluster-pool"
location = "us-central1"
cluster = google_container_cluster.primary.name
node_count = 1
node_config {
preemptible = false
machine_type = "n1-standard-1"
metadata = {
disable-legacy-endpoints = "true"
}
oauth_scopes = [
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/devstorage.read_only"
]
}
}
terraform is now configured and the deployment require the execution
of two commands, terraform init (fetch the providers) and terraform apply used to apply the change to the remote point.
cd gcp
terraform init
terraform apply
NOTE: during deployment I had some issue with the project name. You
can easily debug terraform by configuring TF_LOG environment
variable to TRACE or DEBUG.
# Spawning Containers on Kubernetes
To ensure terraform created everything correctly, a good way is to
list the available clusters with gcloud or via the GCP Console.
gcloud container clusters list
Finally, to ensure if the cluster was correctly deployed with right
information, gcloud command can describe this resource based on the
cluster name and the region or zone.
gcloud container clusters describe xmas-cluster --region=us-central1
When everything seems good, kubectl can be configured locally by
executing get-credentials sub-command. This command will create a
kubeconfig stored by default in ${HOME}/kube/config.
gcloud container clusters get-credentials xmas-cluster --region=us-central1
Now, kubectl should be correctly configured and should point to
xmas-cluster on GCP. kubectl describe sub-command can give more
information about the kubernetes state.
kubectl describe nodes
Kubernetes uses manifest file to deploy the different configuration layers (e.g. services, endpoints or pods). Those files are wrote in YAML or JSON. This manifest will be cut in 3 parts.
First part is about the deployment, in this manifest we pull the image and configure the metadata about xmas container.
apiVersion: apps/v1
kind: Deployment
metadata:
name: xmas
spec:
selector:
matchLabels:
run: xmas
replicas: 3
template:
metadata:
labels:
run: xmas
spec:
containers:
- name: xmas
image: gcr.io/tc-mathieu-kerjouan/xmas
ports:
- containerPort: 8080
This code can be applied by execute kubectl apply
kubectl apply -f deploy.yaml
The second part is about the service and permit to give access to the service from the container (running in a specific port). In this example, a simple LoadBalancer will be deployed.
apiVersion: v1
kind: Service
metadata:
name: xmas
spec:
type: LoadBalancer
externalTrafficPolicy: Cluster
ports:
- port: 8080
protocol: TCP
targetPort: 8080
selector:
run: xmas
This code, like the last one, can be applied by executing kubectl apply.
kubectl apply -f service.yml
The service should now be available, to see if it is the case, the
kubectl describe can return us many information about it.
kubectl describe service xmas
The list of all external IP used by our cluster (and by our service
service xmas) can be found by using a simple filter with jq.
kubectl get service xmas -o json \
| jq ".status.loadBalancer.ingress[].ip"
Or by using a filter directly from kubectl command.
kubectl get service xmas -o custom-columns="SERVICE:.status.loadBalancer.ingress[].ip
In this example, the service got the external IP address
35.184.198.254 and xmas service can be checked with curl.
curl http://35.184.198.254:8080/
# return
# Healthy!
curl http://35.184.198.254:8080/lw/xmas/en
# return
# "Merry christmas"
NOTE: It seems some inbound firewall rules should be applied on the
cluster to allow traffic from different ports. In this case, a
firewall rule can be created with gcloud command
gcloud compute firewall-rules create xmas --allow tcp:8080
- https://cloud.google.com/container-registry/docs/using-with-google-cloud-platform
# Ingress Load Balancing
Due to issue with some metrics, I decided to use an ingress configuration instead of classical load balancer. In this case, I also decided to rename some services and make think more accessible to others. The first step is clean existant configuration, to ensure that those configuration will not disturb our experiment.
kubectl delete service xmas
Ingress configuration need a NodePort, a node port will allow pods to share a port (in our case TCP/8080).
apiVersion: v1
kind: Service
metadata:
name: xmas-web
namespace: default
spec:
ports:
- port: 8080
protocol: TCP
targetPort: 8080
selector:
run: xmas
type: NodePort
To deploy it, use kubectl apply command
kubectl apply -f xmas-service.yaml
Ingress configuration can be set now.
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: basic-ingress
spec:
backend:
serviceName: xmas-ingress
servicePort: 8080
As usual, kubectl apply can be executed to set this configuration.
kubectl apply -f xmas-ingress.yaml
Ingress does not work right now, because GKE need to link an external IP address to it. Wait few seconds (~30s on my side) to obtain the external IP.
kubectl get ingress xmas-ingress
The service is now reachable on the external IP.
curl ${ingress_ip}
# Test script
Generate a connection testing on a web service can be done in many different ways.
# Autoscaling
Horizontal Pod Scaling fetches metrics from these APIs
- metrics.k8s.io
- custom.metrics.k8s.io
- external.metrics.k8s.io
kubectl get --raw '/apis/custom.metrics.k8s.io/v1beta1/' \
| jq . \
| less
Kubernetes offers different versions of autoscaling. In this technical challenge I will deploy a standard Horizontal Pod Autoscaler (HPA) with CPU and RAM.
NOTE: due to some issues with custom metrics, I decided to try all the sub-layer to ensure everything was working correctly.
Autoscaling require to configure GKE accounts and install pods in
kubernetes. You should also ensure that your kubernetes cluster is
correctly configured to allow monitoring and logging. This last part
can be configured directly from the console or with gcloud CLI.
# Standard Metrics AutoScaling
By default and with the v1 API, kubernetes can easily retrieve CPU
and RAM usage. To activate autoscaling, a HPA resource need to be
created.
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: xmas
spec:
scaleTargetRef:
apiVersion: extensions/v1beta1
kind: Deployment
name: xmas
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 80
- type: Resource
resource:
name: memory
targetAverageValue: 200Mi
To create it, as usual, kubectl apply can be executed.
kubectl apply -f xmas-simple-hpa.yaml
A new HPA should be available now.
kubectl get hpa
will output
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
xmas Deployment/xmas 29601792/200Mi, 1%/80% 2 10 2 44m
And more information available with a describe.
kubectl describe hpa xmas
will output
Name: xmas
Namespace: default
Labels: <none>
Annotations: autoscaling.alpha.kubernetes.io/conditions:
[{"type":"AbleToScale","status":"True","lastTransitionTime":"2020-03-05T07:34:41Z","reason":"ReadyForNewScale","message":"recommended size...
autoscaling.alpha.kubernetes.io/current-metrics:
[{"type":"Resource","resource":{"name":"memory","currentAverageValue":"29601792"}},{"type":"Resource","resource":{"name":"cpu","currentAve...
autoscaling.alpha.kubernetes.io/metrics: [{"type":"Resource","resource":{"name":"memory","targetAverageValue":"200Mi"}}]
kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"autoscaling/v2beta1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"xmas","namespace":"default"},"sp...
CreationTimestamp: Thu, 05 Mar 2020 08:34:39 +0100
Reference: Deployment/xmas
Target CPU utilization: 80%
Current CPU utilization: 1%
Min replicas: 2
Max replicas: 10
Deployment pods: 2 current / 2 desired
# Custom Metrics Autoscaling
It seems that autoscaling with custom metrics is a recent technology and can change easily between different release and cloud provider. Due to that, the documentation available is sometime incoherent with many typos.
Custom Metrics were designed in 2017/2018 and presented at the CNCF Conference (opens new window) in 2018 by Solly Ross and Maciej Pytel. In this talk, they are talking about the different autoscaling feature, and describe some usage of pod metrics, custom metrics and external metrics. The original PoC is available on official Solly Ross Github (opens new window) and the official version available on the official kubernetes repository (opens new window) on github.
Deploying with custom metrics on GKE require account and cluster configuration (opens new window). Stackdriver must be available and enabled. Your user must have enough right.
# from official documentation
kubectl create clusterrolebinding cluster-admin-binding \
--clusterrole cluster-admin --user "$(gcloud config get-value account)"
An adapter must be deployed to access and push metrics on Stackdriver.
# from official documentation
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml
Normally, at this step, "custom.metrics.k8s.io" should be available and you should see a list of metrics type.
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" \
| jq .
At the same time, you can also check on the Google Cloud Monitoring console (opens new window).
Here the full resouce list used to debug and try to understand why this was not working:
- https://blog.jetstack.io/blog/resource-and-custom-metrics-hpa-v2/
- https://blog.kloia.com/kubernetes-hpa-externalmetrics-prometheus-acb1d8a4ed50
- https://dzone.com/articles/ensure-high-availability-and-uptime-with-kubernete
- https://dzone.com/articles/kubernetes-autoscaling-explained
- https://github.com/kubernetes/community/blob/master/contributors/design-proposals/autoscaling/hpa-v2.md
- https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/custom-metrics-api.md
- https://github.com/kubernetes/kube-state-metrics
- https://github.com/kubernetes/metrics/blob/master/IMPLEMENTATIONS.md#custom-metrics-api
- https://github.com/stefanprodan/k8s-prom-hpa
- https://istio.io/blog/2017/0.1-canary/
- https://itnext.io/kubernetes-workers-autoscaling-based-on-rabbitmq-queue-size-cb0803193cdf
- https://jakubbujny.com/2018/10/07/container-and-cluster-scaling-on-kubernetes-using-horizontal-pod-autoscaler-and-cluster-autoscaler-on-aws-eks/
- https://koudingspawn.de/kubernetes-autoscaling/
- https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-multiple-metrics-and-custom-metrics
- https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
- https://medium.com/titansoft-engineering/hpa-failed-to-work-on-gke-and-how-we-fixed-it-babdef32e4fc
- https://stefanprodan.com/2018/kubernetes-horizontal-pod-autoscaler-prometheus-metrics/
- https://www.ibm.com/support/knowledgecenter/SSBS6K_2.1.0.3/manage_cluster/hpa.html
Unfortunately, it seems I missed something, and the solution does not work with GKE.
# Local Script
A simple script to check a multi-connection:
#!/bin/sh
_check_connection() {
local max target
target="${1}"
max="${2}"
test ! "${target}" \
&& echo "target required" \
&& return 1
test ! ${max} \
|| max=1
seq 1 ${max} \
| xargs -I%i -P${max} curl "${target}"
}
_check_connection ${*}
This script uses seq, xargs and curl to generate a flow of
parallel connection based on a target and a number of
connections. Assuming this script is called "check.sh".
./check.sh ${target} ${connection_count}
To check 10 connections each second, a simple while loop can solve this issue.
while ./check.sh ${target} 10;
do
sleep 1
done
If the remote site become unreachable, the loop stop.
# Script in Container
With the last script created, it gives a local solution to solve this problem. We can reuse this script and put it in a container, you can use curlimages container (opens new window) and modify it or you can also use different tools like:
# Kubernetes Autoscaling
# Monitoring Application
To manually check if a pod is scaling, you can use gcloud command or
use google cloud console monitoring. In both case, the numbers of
running containers should grow based on the different conditions
defined previously.
# Troubleshooting
# Permissions
A permission issues appears during container creation via gcloud
cli. The name was changed many time and used different kind of
configuration, without success.
gcloud container clusters create merry-christmas --zone us-central1-c
output:
ERROR: (gcloud.container.clusters.create) ResponseError: code=403, message=Google Compute Engine: Required 'compute.networks.get' permission for 'projects/technical-challenges/global/networks/default'
Same issue with the console when deploying on different zone and different cluster size.
Google Compute Engine: Required 'compute.zones.get' permission for 'projects/technical-challenges/zones/us-central1-a'.
Simular issue with Network permissions:
Google Compute Engine: Required 'compute.networks.get' permission for 'projects/technical-challenges/global/networks/default'.
# Check IAM Permissions
gcloud projects get-iam-policy technical-challenges
gcloud iam roles describe roles/owner | grep compute.zones
# - compute.zones.get
# - compute.zones.list
# Check Regions Available
gcloud compute regions list
output:
NAME CPUS DISKS_GB ADDRESSES RESERVED_ADDRESSES STATUS TURNDOWN_DATE
asia-east1 0/72 0/40960 0/69 0/21 UP
asia-east2 0/72 0/40960 0/69 0/21 UP
asia-northeast1 0/72 0/40960 0/69 0/21 UP
asia-northeast2 0/24 0/4096 0/8 0/8 UP
asia-northeast3 0/24 0/4096 0/8 0/8 UP
asia-south1 0/72 0/40960 0/69 0/21 UP
asia-southeast1 0/72 0/40960 0/69 0/21 UP
australia-southeast1 0/72 0/40960 0/69 0/21 UP
europe-north1 0/24 0/4096 0/8 0/8 UP
europe-west1 0/72 0/40960 0/69 0/21 UP
europe-west2 0/72 0/40960 0/69 0/21 UP
europe-west3 0/24 0/4096 0/8 0/8 UP
europe-west4 0/72 0/40960 0/69 0/21 UP
europe-west6 0/24 0/4096 0/8 0/8 UP
northamerica-northeast1 0/72 0/40960 0/69 0/21 UP
southamerica-east1 0/72 0/40960 0/69 0/21 UP
us-central1 0/72 0/40960 0/69 0/21 UP
us-east1 0/72 0/40960 0/69 0/21 UP
us-east4 0/72 0/40960 0/69 0/21 UP
us-west1 0/72 0/40960 0/69 0/21 UP
us-west2 0/72 0/40960 0/69 0/21 UP
us-west3 0/24 0/4096 0/8 0/8 UP
So, It seems all regions have enough ressources to offer.
# Create a manual instance
work as expected.
# Check on a fresh created account
work as expected.
# Terraform deployment issue
Error 500 is returned with kubernetes service due to project name and/or when the zone/region is not configured.
# FAQ
# How to debug prometheus?
A prometheus pod was created after the configuration of custom metrics and external metrics from the official Google Cloud. This pod is running in monitoring namespace
kubectl -n monitoring get pods
Take the ID of the pod, and read the logs
kubectl -n monitoring logs prometheus-6b76966578-wggzx
It will output something like that:
level=info ts=2020-03-04T20:15:37.806525673Z caller=main.go:225 msg="Starting Prometheus" version="(version=2.1.0, branch=HEAD, revision=85f23d82a045d103ea7f3c89a91fba4a93e6367a)"
level=info ts=2020-03-04T20:15:37.806868433Z caller=main.go:226 build_context="(go=go1.9.2, user=root@6e784304d3ff, date=20180119-12:01:23)"
level=info ts=2020-03-04T20:15:37.807039003Z caller=main.go:227 host_details="(Linux 4.14.138+ #1 SMP Tue Sep 3 02:58:08 PDT 2019 x86_64 prometheus-6b76966578-wggzx (none))"
level=info ts=2020-03-04T20:15:37.807125509Z caller=main.go:228 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2020-03-04T20:15:37.812151155Z caller=main.go:499 msg="Starting TSDB ..."
level=info ts=2020-03-04T20:15:37.824408896Z caller=web.go:383 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-03-04T20:15:37.834619663Z caller=main.go:509 msg="TSDB started"
level=info ts=2020-03-04T20:15:37.834873416Z caller=main.go:585 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2020-03-04T20:15:37.836629348Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-03-04T20:15:37.837682764Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-03-04T20:15:37.838433309Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-03-04T20:15:37.839227596Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-03-04T20:15:37.840029983Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-03-04T20:15:37.840654082Z caller=main.go:486 msg="Server is ready to receive web requests."
level=info ts=2020-03-04T20:15:37.842866397Z caller=manager.go:59 component="scrape manager" msg="Starting scrape manager..."
level=info ts=2020-03-04T23:00:04.538488734Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1583352000000 maxt=1583359200000
level=info ts=2020-03-04T23:00:06.073084563Z caller=head.go:348 component=tsdb msg="head GC completed" duration=50.311649ms
level=info ts=2020-03-04T23:00:06.073409207Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=3.169µs
level=info ts=2020-03-05T01:00:04.44766283Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1583359200000 maxt=1583366400000
level=info ts=2020-03-05T01:00:06.024622789Z caller=head.go:348 component=tsdb msg="head GC completed" duration=52.554305ms
level=info ts=2020-03-05T01:00:06.506285305Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=481.307648ms
level=info ts=2020-03-05T03:00:04.55516356Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1583366400000 maxt=1583373600000
level=info ts=2020-03-05T03:00:06.157425284Z caller=head.go:348 component=tsdb msg="head GC completed" duration=50.882162ms
level=info ts=2020-03-05T03:00:06.653491516Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=495.78152ms
level=info ts=2020-03-05T05:00:04.554785703Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1583373600000 maxt=1583380800000
level=info ts=2020-03-05T05:00:06.079075114Z caller=head.go:348 component=tsdb msg="head GC completed" duration=62.22199ms
level=info ts=2020-03-05T05:00:06.387626217Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=308.240291ms
level=info ts=2020-03-05T07:00:04.813504837Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1583380800000 maxt=1583388000000
level=info ts=2020-03-05T07:00:06.407791167Z caller=head.go:348 component=tsdb msg="head GC completed" duration=55.376979ms
level=info ts=2020-03-05T07:00:06.900724634Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=492.641312ms
level=info ts=2020-03-05T09:00:00.434882319Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1583388000000 maxt=1583395200000
level=info ts=2020-03-05T09:00:01.888288174Z caller=head.go:348 component=tsdb msg="head GC completed" duration=50.724587ms
level=info ts=2020-03-05T09:00:02.390778444Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=502.175209ms
You can also access the configuration by using kubectl exec.
kubectl -n monitoring exec prometheus-6b76966578-wggzx cat /etc/prometheus/prometheus.yml
# How to list all available metrics?
On Google Cloud offers you a long list of monitoring resource, available here:
- https://console.cloud.google.com/monitoring/dashboards/resourceList/kubernetes
Standard metrics are available directly from "metrics.k8s.io" API endpoint.
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/" \
| jq .
On kubernetes, it's a bit harder see all available resources with monitoring support. With GKE, and after the configuration of your cluster with prometheus, you should have access to different REST endpoint giving you a full list of metrics.
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" \
| jq .
If you have external defined resources to monitor, you can list them with "external.metrics.k8s.io" endpoint
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/" \
| jq .
NOTE: many versions are available by querying only "/apis/external.metrics.k8s.io/" for example.
# Resources
# GCP
- https://github.com/GoogleCloudPlatform/docker-credential-gcr
- https://github.com/GoogleCloudPlatform/docker-credential-gcr/releases
- https://cloud.google.com/sdk/docs/downloads-versioned-archives
- https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler
- https://cloud.google.com/kubernetes-engine/docs/how-to/horizontal-pod-autoscaling
- https://cloud.google.com/blog/products/containers-kubernetes/using-advanced-kubernetes-autoscaling-with-vertical-pod-autoscaler-and-node-auto-provisioning
- https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-more-specific-metrics
- https://medium.com/google-cloud/kubernetes-autoscaling-with-istio-metrics-76442253a45a
- https://docs.bitnami.com/kubernetes/how-to/configure-autoscaling-custom-metrics/
- https://blog.doit-intl.com/autoscaling-k8s-hpa-with-google-http-s-load-balancer-rps-stackdriver-metric-92db0a28e1ea
- https://cloud.google.com/monitoring/api/metrics_gcp
- https://cloud.google.com/monitoring/kubernetes-engine#metrics_explorer
# Authentication
- https://cloud.google.com/docs/authentication/production
# Registry
- https://cloud.google.com/container-registry/docs/pushing-and-pulling
- https://cloud.google.com/container-registry/pricing?hl=fr
- https://cloud.google.com/container-registry/docs/advanced-authentication?hl=fr
- https://cloud.google.com/container-registry/docs/using-with-google-cloud-platform
# Kubernetes
- https://kubernetes.io/docs/reference/kubectl/cheatsheet/
- https://cloud.google.com/kubernetes-engine/docs/
- https://cloud.google.com/kubernetes-engine/docs/how-to/creating-a-cluster
- https://cloud.google.com/kubernetes-engine/docs/concepts/deployment
- https://kubernetes.io/docs/tutorials/stateless-application/expose-external-ip-address/
- https://cloud.google.com/kubernetes-engine/docs/how-to/exposing-apps
- https://v1-13.docs.kubernetes.io/docs/concepts/services-networking/service/
- https://kubernetes.io/docs/concepts/services-networking/service/
# Terraform
https://cloud.google.com/community/tutorials/managing-gcp-projects-with-terraform
https://console.cloud.google.com/apis/credentials
https://www.terraform.io/docs/providers/google/guides/provider_reference.html
https://www.hashicorp.com/blog/managing-kubernetes-applications-with-hashicorp-terraform/
https://www.terraform.io/docs/providers/kubernetes/r/pod.html