Skip to content

Commit

Permalink
github: release v2.0.11
Browse files Browse the repository at this point in the history
  • Loading branch information
renxiang committed Apr 26, 2024
1 parent a1430dc commit ecce8b2
Show file tree
Hide file tree
Showing 41 changed files with 35,275 additions and 5,496 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ mlu-exporter
image
.vscode
*.tar*
_build/
__pycache__/
2 changes: 1 addition & 1 deletion .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

image: 10.3.68.2:5001/cambricon/buildpack:20210507
image: yellow.hub.cambricon.com/caip/ci/buildpack:20230712
variables:
GOPROXY: http://10.3.68.2:8080

Expand Down
69 changes: 69 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,74 @@
# Changelog

## v2.0.11

- Remove mem-share

## v2.0.10

- Support sram/dram ecc err
- Add mlu nums
- Add heartbeat count

## v2.0.9

- Support dsmlu restore
- Bump cndev to 3.9.0
- Stop annoy metrics

## v2.0.8

- Support dynamic smlu monitoring
- Support new metrics to align with dcgm
- Support to print out version
- Refactor to use golden test
- Bump cndev to 3.8.0
- Support xid errors metrics

## v2.0.7

- Support smlu static
- Fix memory total and used

## v2.0.6

- Get rid of annoy metrics cause long latency
- Add prometheus additional config

## v2.0.5

- Upgrade dependence to cndev 3.4.2
- Eliminate annoying printing
- Replace ioutil with os package
- Add metric:
- parity_error

## v2.0.4

- Support mlu share mode
- Upgrade dependence to cndev 3.4.1
- Bump go to 1.19 and baseimage to ubuntu:20.04

## v2.0.3

- Add liveness/readiness probes
- Add log level config
- Remove beartoken in servicemonitor config
- Report vf metrics in env share mode as in sriov mode

## v2.0.1

- Upgrade dependence to cndev 3.0.1
- Deprecated and remove cnpapi dependencies
- Refactor collect test
- Merge vf metrics with pf metrics

## v2.0.0

- Upgrade dependence to cndev 3.0.0
- Add metric:
- virtual_function_power_usage

## v1.6.7

- Upgrade dependence to cntoolkit 2.8.2
Expand Down
5 changes: 3 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,12 @@
# limitations under the License.

ARG BUILDPLATFORM=linux/amd64
ARG BASE_IMAGE=ubuntu:18.04
FROM --platform=$BUILDPLATFORM golang:1.13 as build
ARG BASE_IMAGE=ubuntu:20.04
FROM --platform=$BUILDPLATFORM golang:1.19 as build
ARG APT_PROXY
ARG GOPROXY
ARG TARGETPLATFORM
ARG VERSION
RUN set -ex && export http_proxy=$APT_PROXY && \
apt-get update && \
apt-get install -y build-essential gcc-aarch64-linux-gnu ca-certificates make
Expand Down
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
# limitations under the License.

TARGETPLATFORM ?= linux/amd64
VERSION ?= v1.0.0
export GOOS := $(word 1, $(subst /, ,$(TARGETPLATFORM)))
export GOARCH := $(word 2, $(subst /, ,$(TARGETPLATFORM)))
export CGO_ENABLED := 1
Expand All @@ -23,14 +24,13 @@ endif
generate:
mockgen -package mock -destination pkg/mock/cndev.go -mock_names=Cndev=Cndev github.com/Cambricon/mlu-exporter/pkg/cndev Cndev
mockgen -package mock -destination pkg/mock/podrsources.go -mock_names=PodResources=PodResources github.com/Cambricon/mlu-exporter/pkg/podresources PodResources
mockgen -package mock -destination pkg/mock/cnpapi.go -mock_names=Cnpapi=Cnpapi github.com/Cambricon/mlu-exporter/pkg/cnpapi Cnpapi
mockgen -package mock -destination pkg/mock/host.go -mock_names=Host=Host github.com/Cambricon/mlu-exporter/pkg/host Host

lint:
golangci-lint run -v

build:
go build -trimpath -ldflags="-s -w" -o mlu-exporter .
go build -trimpath -ldflags="-s -w" -ldflags="-X 'main.version=$(VERSION)'" -o mlu-exporter .

test:
go test -v ./...
Expand Down
58 changes: 34 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,11 @@ Prometheus exporter for Cambricon MLU metrics, written in Go with pluggable metr
The prerequisites for running Cambricon MLU Exporter:

- MLU270, MLU270-X5K, MLU220, MLU290, MLU370 devices
- MLU driver >= 4.20.9
- cntoolkit >= 2.8.2 on your building machine
- For MLU 2xx needs driver >= 4.9.13
- For MLU 3xx needs driver >= 4.20.9
- For MLU 2xx、3xx needs cntoolkit >= 2.8.2 on your building machine

For MLU driver version 4.9.x, please use [release v1.5.3].
For MLU driver version before 4.9.13, please use [release v1.5.3].

## Installation and Usage

Expand All @@ -27,14 +28,13 @@ cd mlu-exporter

Set the following environment variables if you need.

| env | description |
| ---------- | ------------------------------------------------------------------------------ |
| APT_PROXY | apt proxy address |
| GOPROXY | golang proxy address |
| ARCH | target platform architecture, amd64 or arm64, amd64 by default |
| LIBCNDEV | absolute path of the libcndev.so binary, neuware installation path by default |
| LIBCNPAPI | absolute path of the libcnpapi.so binary, neuware installation path by default |
| BASE_IMAGE | mlu exporter base image |
| env | description |
| ---------- | ----------------------------------------------------------------------------- |
| APT_PROXY | apt proxy address |
| GOPROXY | golang proxy address |
| ARCH | target platform architecture, amd64 or arm64, amd64 by default |
| LIBCNDEV | absolute path of the libcndev.so binary, neuware installation path by default |
| BASE_IMAGE | mlu exporter base image |

Docker should be >= 17.05.0 on your building machine. If you want to cross build, make sure docker version >= 19.03.

Expand Down Expand Up @@ -67,7 +67,7 @@ docker run -d \
--privileged=true \
--pid=host \
-e ENV_NODE_NAME={nodeName} \
cambricon-mlu-exporter:v1.6.7
cambricon-mlu-exporter:v2.0.11
```

Then use the following command to get the metrics.
Expand All @@ -84,10 +84,11 @@ docker run -d \
-v examples/metrics.yaml:/etc/mlu-exporter/metrics.yaml \
--privileged=true \
--pid=host \
cambricon-mlu-exporter:v1.6.7 \
cambricon-mlu-exporter:v2.0.11 \
mlu-exporter \
--metrics-config=/etc/mlu-exporter/metrics.yaml \
--metrics-path=/metrics \
--log-level=info \
--port=30108 \
--hostname=hostname \
--metrics-prefix=mlu \
Expand All @@ -97,20 +98,22 @@ mlu-exporter \

Command Args Description

| arg | description |
| -------------- | ------------------------------------------ |
| metrics-config | configuration file of MLU exporter metrics |
| metrics-path | metrics path of the exporter service |
| hostname | machine hostname, or env:"ENV_NODE_NAME" |
| port | exporter service port |
| metrics-prefix | prefix of all metric names |
| collector | collector names, cndev by default |
| arg | description |
| -------------- | ---------------------------------------------------------------------- |
| collector | collector names, cndev by default |
| env-share-num | vf numbers under env share mode, should set virtual-mode to env-share |
| hostname | machine hostname, or env:"ENV_NODE_NAME" |
| log-level | set log level: trace/debug/info/warn/error/fatal/panic" default:"info" |
| metrics-config | configuration file of MLU exporter metrics |
| metrics-path | metrics path of the exporter service |
| metrics-prefix | prefix of all metric names |
| port | exporter service port |
| virtual-mode | virtual mode, default "", support dynamic-smlu, env-share |

available collectors:

- cndev: collects basic MLU metrics
- podresources: collects MLU usage metrics in containers managed by Kubernetes. For Kubernetes lower than 1.15, make sure `KubeletPodResources` [feature gate] is enabled by setting the `feature-gates` [kubelet option] in your kubelet configuration.
- cnpapi: collects cnpapi pmu api metrics. It does not support SR-IOV. **Please note that cnpapi can only be used by one single process on a machine. Not recommended for production scenarios.**
- host: collects host machine metrics

And set the metrics configuration file passed by your metrics-config arg as you like, see examples/metrics.yaml for an example.
Expand Down Expand Up @@ -141,6 +144,12 @@ And if you want to create a Prometheus service monitor
kubectl apply -f examples/cambricon-mlu-exporter-sm.yaml
```

if you want to create a Prometheus [additional scrape configs]

```shell
kubectl create secret generic additional-scrape-configs --from-file=examples/cambricon-mlu-exporter-additional.yaml
```

Then checkout your Prometheus to get the MLU metrics.

##### Group Metrics
Expand All @@ -153,10 +162,10 @@ To attach namespace/pod/container info to another MLU metric, use `mlu_container
mlu_power_usage * on(uuid) group_right mlu_container
```

And for SR-IOV VFs:
And for env-share VFs:

```text
mlu_virtual_function_utilization * on(uuid,vf) group_right mlu_container
mlu_utilization * on(uuid,vf) group_right mlu_container
```

#### Metrics and Labels
Expand All @@ -174,3 +183,4 @@ For MLU370, mlu_temperature does not support cluster temperature, all cluster te
[release v1.5.3]: https://github.com/Cambricon/mlu-exporter/releases/tag/v1.5.3
[feature gate]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
[kubelet option]: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#options
[additional scrape configs]: https://github.com/prometheus-operator/prometheus-operator/tree/main/example/additional-scrape-configs
20 changes: 3 additions & 17 deletions build_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,24 +16,22 @@
curpath=$(dirname "$0")
cd "$curpath" || exit 1

: "${TAG:=v1.6.7}"
: "${TAG:=v2.0.11}"
: "${ARCH:=amd64}"
: "${LIBCNDEV:=/usr/local/neuware/lib64/libcndev.so}"
: "${LIBCNPAPI:=/usr/local/neuware/lib64/libcnpapi.so}"

case $(awk -F= '/^NAME/{print $2}' /etc/os-release) in
"CentOS Linux")
BASE_IMAGE=centos:7
;;
esac

: "${BASE_IMAGE:=ubuntu:18.04}"
: "${BASE_IMAGE:=ubuntu:22.04}"

echo "Build environ (Can be overridden):"
echo "TAG = $TAG"
echo "ARCH = $ARCH"
echo "LIBCNDEV = $LIBCNDEV"
echo "LIBCNPAPI = $LIBCNPAPI"
echo "APT_PROXY = $APT_PROXY"
echo "GOPROXY = $GOPROXY"
echo "BASE_IMAGE = $BASE_IMAGE"
Expand All @@ -60,12 +58,6 @@ if [[ ! -f "$LIBCNDEV" ]]; then
exit 1
fi

if [[ ! -f "$LIBCNPAPI" ]]; then
echo "Can't find libcnpapi.so at $LIBCNPAPI."
echo "If you want to scrape cnpapi metrics, please install Cambricon neuware, or set LIBCNPAPI environ to path of libcnpapi.so"
echo "Else, ignore this message."
fi

case $ARCH in
amd64)
file_arch=x86-64
Expand All @@ -84,13 +76,7 @@ if ! file "$LIBCNDEV" --dereference | grep -q "$file_arch"; then
exit 1
fi

if [[ -f "$LIBCNPAPI" ]] && ! file "$LIBCNPAPI" --dereference | grep -q "$file_arch"; then
echo "$LIBCNPAPI is not for $ARCH"
exit 1
fi

cp "$LIBCNDEV" "$curpath/libs/linux/$ARCH/libcndev.so"
[[ -f "$LIBCNPAPI" ]] && cp "$LIBCNPAPI" "$curpath/libs/linux/$ARCH/libcnpapi.so"

echo "Building Cambricon MLU Exporter docker image."

Expand All @@ -100,6 +86,7 @@ echo "Building Cambricon MLU Exporter docker image."
--build-arg "GOPROXY=$GOPROXY" --build-arg "APT_PROXY=$APT_PROXY" \
--build-arg "BUILDPLATFORM=linux/$ARCH" \
--build-arg "BASE_IMAGE=$BASE_IMAGE" \
--build-arg "VERSION=$TAG" \
--build-arg "TARGETPLATFORM=linux/$ARCH" .

[[ "$ARCH" == "$build_arch" ]] && docker save -o "image/cambricon-mlu-exporter-$ARCH.tar" \
Expand All @@ -118,4 +105,3 @@ fi

echo "Image is saved at ./image/cambricon-mlu-exporter-$ARCH.tar"
rm -f "$curpath/libs/linux/$ARCH/libcndev.so"
rm -f "$curpath/libs/linux/$ARCH/libcnpapi.so"
17 changes: 17 additions & 0 deletions examples/cambricon-mlu-exporter-additional.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
- job_name: kube-prometheus-exporter-mlu-monitoring
kubernetes_sd_configs:
- namespaces:
names:
- kube-system
role: endpoints
metrics_path: /metrics
relabel_configs:
- source_labels: [__meta_kubernetes_endpoints_name]
action: keep
regex: "kube-prometheus-exporter-mlu-monitoring"
- action: replace
source_labels:
- __meta_kubernetes_endpoint_node_name
target_label: node
scheme: http
scrape_interval: 30s
Loading

0 comments on commit ecce8b2

Please sign in to comment.