0%

使用普罗米修斯和Grafana监控Flink运行状态

Pushgateway

pushgateway 是一个Prometheus 生态中重要工具,因为Prometheus采用Pull模式,可能由于一些原因,Prometheus无法直接拉取各个target的数据,需要有个地方统一先收集起来

下载安装

1
2
3
4
5
6
cd /usr/local/prometheus
wget https://github.com/prometheus/pushgateway/releases/download/v1.0.0/pushgateway-1.0.0.linux-amd64.tar.gz
tar -zxvf pushgateway-1.0.0.linux-amd64.tar.gz
cd pushgateway-1.0.0.linux-amd64
# 启动
nohup /usr/local/prometheus/pushgateway-1.0.0.linux-amd64/pushgateway > /usr/local/prometheus/pushgateway-1.0.0.linux-amd64/nohup.out 2>&1 &

node_exporter 安装

下载安装

1
2
3
wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz
tar -zxvf node_exporter-0.18.1.linux-amd64.tar.gz
nohup /usr/local/prometheus/node_exporter-0.18.1.linux-amd64/node_exporter > /usr/local/prometheus/node_exporter-0.18.1.linux-amd64/nohup.out 2>&1 &

Prometheus 安装

下载安装

1
2
3
4
5
6
# 新建 /usr/local/prometheus 目录
mkdir /usr/local/prometheus
cd /usr/local/prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz
tar -zxvf prometheus-2.14.0.linux-amd64.tar.gz
cd prometheus-2.14.0.linux-amd64

默认的配置

Prometheus 默认会采集本身的一些运行信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.

static_configs:
- targets: ['localhost:9090']

修改后的配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.

static_configs:
- targets: ['localhost:9090']
- job_name: 'linux'
static_configs:
- targets: ['localhost:9100']
labels:
instance: 'localhost'
- job_name: 'pushgateway'
static_configs:
- targets: ['localhost:9091']
labels:
instance: 'pushgateway'

启动

1
nohup /usr/local/prometheus/prometheus-2.14.0.linux-amd64/prometheus --config.file=/usr/local/prometheus/prometheus-2.14.0.linux-amd64/prometheus.yml >/usr/local/prometheus/prometheus-2.14.0.linux-amd64/nohup.out 2>&1 &

查看端口

1
netstat -apn | grep -E '9091|3000|9090|9100'

image

查看target

image

Flink

修改配置文件

在 flink的安装目录的 conf/flink-conf.yaml 中增加以下配置(host为上面安装pushgateway的机器host)

1
2
3
4
5
6
metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: host
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: job
metrics.reporter.promgateway.randomJobNameSuffix: true
metrics.reporter.promgateway.deleteOnShutdown: false

拷贝jar文件

1
2
cd /usr/local/flink/current
cp opt/flink-metrics-prometheus-1.9.1.jar lib/

Grafana

下载安装

1
2
wget https://dl.grafana.com/oss/release/grafana-6.4.4.linux-amd64.tar.gz
tar -zxvf grafana-6.4.4.linux-amd64.tar.gz

启动

1
nohup /usr/local/grafana/grafana-6.4.4/bin/grafana-server web >/usr/local/grafana/grafana-6.4.4/nohup.out 2>&1 &

image

使用自定义的pushgateway jobname上报

参考
monitoring - How could I override configuration value in Apache Flink? - Stack Overflow

问题

问题1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
2019-11-12 16:07:48,899 ERROR org.apache.flink.runtime.metrics.ReporterSetup                - Could not instantiate metrics reporter promgateway. Metrics might not be exposed/reported.
java.lang.ClassNotFoundException: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.flink.runtime.metrics.ReporterSetup.loadViaReflection(ReporterSetup.java:242)
at org.apache.flink.runtime.metrics.ReporterSetup.loadReporter(ReporterSetup.java:210)
at org.apache.flink.runtime.metrics.ReporterSetup.fromConfiguration(ReporterSetup.java:162)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.createMetricRegistry(ClusterEntrypoint.java:305)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:261)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:202)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:163)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:501)
at org.apache.flink.yarn.entrypoint.YarnSessionClusterEntrypoint.main(YarnSessionClusterEntrypoint.java:93)

解决: 需要拷贝jar

1
cp opt/flink-metrics-prometheus-1.9.1.jar lib/

问题2

1
2
3
4
5
6
7
8
9
10
11
12
13
java.io.IOException: Response code from http://server3:9091/metrics/job/fibodata5ab95bcaadf9b4c7d3a61220f0945f77 was 200
at org.apache.flink.shaded.io.prometheus.client.exporter.PushGateway.doRequest(PushGateway.java:297)
at org.apache.flink.shaded.io.prometheus.client.exporter.PushGateway.push(PushGateway.java:105)
at org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter.report(PrometheusPushGatewayReporter.java:76)
at org.apache.flink.runtime.metrics.MetricRegistryImpl$ReporterTask.run(MetricRegistryImpl.java:436)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2019-11-12 16:40:06,645 WARN org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter - Failed to push metrics to PushGateway with jobName fibodata5ab95bcaadf9b4c7d3a61220f0945f77.

暂未找到原因,可能是框架本身的问题