zhmg23

我们是如此的不同

安装docker报错Requires: container-selinux >= 2:2.74

安装docker报错:

Error: Package: 3:docker-ce-20.10.8-3.el7.x86_64 (docker-ce-stable)

           Requires: container-selinux >= 2:2.74

Error: Package: docker-ce-rootless-extras-20.10.8-3.el7.x86_64 (docker-ce-stable)

           Requires: fuse-overlayfs >= 0.7

Error: Package: docker-ce-rootless-extras-20.10.8-3.el7.x86_64 (docker-ce-stable)

           Requires: slirp4netns >= 0.4

Error: Package: containerd.io-1.4.9-3.1.el7.x86_64 (docker-ce-stable)

           Requires: container-selinux >= 2:2.74

You could try using --skip-broken to work around the problem

You could try running: rpm -Va --nofiles --nodigest


解决办法:

wget -O /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo

yum install epel-release -y

yum install container-selinux -y    #安装最新的contain-selinux



win10+vscode调试Go报错

在win10上通过vscode来调试go,安装后之后,跑个Hello world,但一跑就报错是:

failed to continue check the debug console for details golang

查了好多,都不对,最终是修改launch.json

 "version": "0.2.0",


    "configurations": [

        {

            "name": "Launch file",

            "type": "go",

            "request": "launch",

            "mode": "auto",

            "program": "${file}",

            "env": {

              "PATH": "你的GOPATH路径"

            },

            "args": []

          }

    ]

}

注:cmd下运行go env可获取到




基于DCGM的Prometheus+grafana的GPU监控

DCGM(Data Center GPU Manager)即数据中心GPU管理器,是一套用于在集群环境中管理和监视Tesla™GPU的工具。

它包括主动健康监控,全面诊断,系统警报以及包括电源和时钟管理在内的治理策略。

1、首先安装 NVIDIA Data Center GPU Manager (DCGM)

wget https://developer.download.nvidia.com/compute/DCGM/secure/2.0.13/RPMS/x86_64/datacenter-gpu-manager-2.0.13-1-x86_64.rpm

rpm -ivh datacenter-gpu-manager-2.0.13-1-x86_64.rpm


systemctl  start  dcgm.service

systemctl  status dcgm.service

systemctl  enable  dcgm.service


2、安装 NVIDIA DCGM exporter for Prometheus

先安装go

wget  https://golang.org/doc/install?download=go1.15.2.linux-amd64.tar.gz

tar -C /usr/local -xzf go1.15.2.linux-amd64.tar.gz

vim /etc/profile

export PATH=$PATH:/usr/local/go/bin


安装gpu-monitoring-tools

# git clone https://github.com/NVIDIA/gpu-monitoring-tools.git

# cd /usr/server/gpu-monitoring-tools/

# make binary

# make install

go build -o dcgm-exporter github.com/NVIDIA/gpu-monitoring-tools/pkg

install -m 557 dcgm-exporter /usr/bin/dcgm-exporter

install -m 557 -D ./etc/dcgm-exporter/default-counters.csv /etc/dcgm-exporter/default-counters.csv

install -m 557 -D ./etc/dcgm-exporter/dcp-metrics-included.csv /etc/dcgm-exporter/dcp-metrics-included.csv


配置运行 dcgm-exporter

systemctl  start dcgm-exporter

systemctl  status  dcgm-exporter

systemctl  enable dcgm-exporter


# cat /usr/lib/systemd/system/dcgm-exporter.service

[Unit]

Description=Node Exporter

Wants=network-online.target

After=network-online.target


[Service]

User=root

ExecStart=/data/server/dcgm-exporter/dcgm-exporter


[Install]

WantedBy=default.target

3、Prometheus修改配置

  - job_name: 'GPU-112.23'

    static_configs:

      - targets: ['10.100.1.23:9400']


# systemctl  restart prometheus


4、下载grafana模板

https://grafana.com/grafana/dashboards/12027



Elasticsearch状态为red产生大量UNASSIGNED解决办法

昨天测试环境Elasticsearch集群状态突然为red了,原因是新起了一个集群,集群配置未及时调整,跟这个集群融入到一起,导致原集群状态产生大量UNASSIGNED。

1、简单的说一下Elasticsearch的三种状态:

green

所有的主分片和副本分片都已分配。你的集群是 100% 可用的。

yellow

所有的主分片已经分片了,但至少还有一个副本是缺失的。不会有数据丢失,所以搜索结果依然是完整的。不过,你的高可用性在某种程度上被弱化。如果 更多的 分片消失,你就会丢数据了。把 yellow 想象成一个需要及时调查的警告。

red

至少一个主分片(以及它的全部副本)都在缺失中。这意味着你在缺少数据:搜索只能返回部分数据,而分配到这个分片上的写入请求会返回一个异常。


2、简单过程回顾

Elasticsearch状态异常时,通过如下命令,查看一下

# curl -s 'localhost:8200/_cat/shards' | fgrep UNASSIGNED


注:第一列表示索引名,第二列表示分片编号,第三列p是主分片,r是副本

unassigned_shards 是已经在集群状态中存在的分片,但是实际在集群里又找不着。通常未分配分片的来源是未分配的副本。比如,一个有 5 分片和 1 副本的索引,在单节点集群上,就会有 5 个未分配副本分片。如果你的集群是 red 状态,也会长期保有未分配分片(因为缺少主分片)。


解决办法一:

手动删除这些(生产环境谨慎):

curl -XDELETE 'localhost:8200/user_re_3/'

curl -XDELETE 'localhost:8200/user_error_2/'



解决办法二:

重新分配索引

#!/bin/bash

range=2

IFS=$'\n'

for line in $(curl -s 'localhost:8200/_cat/shards' | fgrep UNASSIGNED); do

  INDEX=$(echo $line | (awk '{print $1}'))

  SHARD=$(echo $line | (awk '{print $2}'))

  number=$RANDOM

  let "number %= ${range}"


  curl -H "Content-Type: application/json" -XPOST https://localhost:8200/_cluster/reroute? -d '{

  "commands" : [ {

  "allocate_empty_primary" :

  {

    "index" : '\"${INDEX}\"',

    "shard" : '\"${SHARD}\"',

    "node" : "master_node",

    "accept_data_loss" : true

  }

}

]

}'

done


注:master_node为你的集群主节点



解决办法三:

新开启一个节点,此方案未验证,前面两个都已验证,一般推荐方案二

新启动一个节点,自动恢复后,在关闭


Prometheus +Grafana监控Elasticsearch集群

1、监控说明

监控原理跟我上一遍记录prometheus监控redis类似,通过elasticsearch_exporter,对Elasticsearch集群进行监控


2、下载安装elasticsearch_exporter

https://github.com/justwatchcom/elasticsearch_exporter

3、解压配置启动

cd /usr/local/elasticsearch_exporter

nohup ./elasticsearch_exporter --es.uri https://localhost:9200 &

注:如果多个集群,可以加上参数--web.listen-address=":9115"


4、配置Prometheus

vim prometheus.yml

  #采集ES集群监控数据

  - job_name: ES_cluster_9200

    static_configs:

      - targets: ['10.31.65.129:9114','10.31.65.130:9114','10.31.65.131:9114']

        labels:

          instance: elasticsearch_cluster_9200


重启Prometheus

systemctl  restart  prometheus


5、下载Grafana模板并导入

Grafana Dashboard:https://grafana.com/grafana/dashboards/2322



导入json



稍微等一会,就会有数据






Failed to read PID from file nginx.pid: Invalid

在centos7下通过systemctl方式启动nginx时,报错Failed to read PID from file  nginx.pid: Invalid argument



解决办法,就是加一行参数


[Unit]

Description=nginx

After=network.target


[Service]

Type=forking

PIDFile=/usr/server/nginx/sbin/nginx.pid

ExecStartPost=/bin/sleep 0.1

ExecStartPre=/usr/local/nginx/sbin/nginx -t

ExecStart=/usr/local/nginx/sbin/nginx

ExecReload=/usr/local/nginx/sbin/nginx -s reload

ExecStop=/bin/kill -s QUIT $MAINPID

TimeoutStopSec=5

KillMode=mixed

Restart=always

LimitNOFILE=3205535


[Install]

WantedBy=multi-user.target


pip 源切换至国内镜像


常用国内镜像

https://pypi.tuna.tsinghua.edu.cn/simple/   # 清华大学

https://mirrors.aliyun.com/pypi/simple/     # 阿里云

https://pypi.douban.com/simple/             # 豆瓣

https://pypi.mirrors.ustc.edu.cn/simple/    # 中国科学技术大学

https://pypi.hustunique.com/                # 华中科技大学



使用方法

1、临时使用,使用时直接 -i 加 url 即可

通过清华镜像安装 pip

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple/ pip


2、永久使用

在用户的根目录下创建 .pip 文件夹,新建 pip.conf 文件,在文件中写入要使用的镜像

mkdir -p  ~/.pip/

vim  ~/.pip/.pip.conf


[global]

index-url = https://pypi.tuna.tsinghua.edu.cn/simple

[install]

trusted-host = https://pypi.tuna.tsinghua.edu.cn




注:

-r 遍历并安装requestment.txt中的包

-U 升级包


CentOS7下安装Redash

一、说明

操作系统:CentOS Linux release 7.x

docker  Server Version: 19.03.8

redash:8.0.0

二、安装docker-ce  docker-compose

1、安装docker-ce

首先删除较旧版本的docker(如果有):

yum remove docker docker-common docker-selinux docker-engine-selinux docker-engine docker-ce

下一步安装需要的软件包:

yum install -y yum-utils device-mapper-persistent-data lvm2

配置docker-ce repo:

yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

最后安装docker-ce:

yum install docker-ce

修改默认存储路径

vim /usr/lib/systemd/system/docker.service 

ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

修改为:

ExecStart=/usr/bin/dockerd --graph /data/docker  -H fd:// --containerd=/run/containerd/containerd.sock


2、安装

sudo curl -L "https://github.com/docker/compose/releases/download/1.23.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

chmod +x /usr/local/bin/docker-compose


三、安装Redash

mkdir /opt/redash

cd /opt/redash

#从GitHub下载源码

git clone https://github.com/getredash/redash.git


#创建docker-compose.production.yml文件,内容参考yml配置

touch docker-compose.production.yml  内容如下:

 

# This is an example configuration for Docker Compose. Make sure to atleast update

# the cookie secret & postgres database password.

#

# Some other recommendations:

# 1. To persist Postgres data, assign it a volume host location.

# 2. Split the worker service to adhoc workers and scheduled queries workers.

version: '3.2'

services:

  server:

    image: redash/redash:8.0.0.b32245

    command: server

    depends_on:

      - postgres

      - redis

    ports:

      - "5000:5000"

    environment:

      PYTHONUNBUFFERED: 0

      REDASH_LOG_LEVEL: "INFO"

      REDASH_REDIS_URL: "redis://redis:6379/0"

      REDASH_DATABASE_URL: "postgresql://postgres@postgres/postgres"

      REDASH_COOKIE_SECRET: "XJ22k6vaXUk8"

      REDASH_WEB_WORKERS: 4   

      #邮箱 

      REDASH_MAIL_SERVER: "mail.yourdomain.com"

      REDASH_MAIL_PORT: 25

      REDASH_MAIL_USE_TLS: "false"

      REDASH_MAIL_USE_SSL: "false"

      REDASH_MAIL_USERNAME: "report@yourdomain.com"

      REDASH_MAIL_PASSWORD: "YourPassword"

      REDASH_MAIL_DEFAULT_SENDER: "report@yourdomain.com"

      REDASH_HOST: "https://redash.yourdomain.com"

    restart: always

  worker:

    image: redash/redash:8.0.0.b32245

    command: scheduler

    environment:

      PYTHONUNBUFFERED: 0

      REDASH_LOG_LEVEL: "INFO"

      REDASH_REDIS_URL: "redis://redis:6379/0"

      REDASH_DATABASE_URL: "postgresql://postgres@postgres/postgres"

      QUEUES: "queries,scheduled_queries,celery"

      WORKERS_COUNT: 2

    restart: always

  redis:

    image: redis:3.0-alpine

    ports:

     - "6379:6379"

    volumes: 

      - ./data/redis:/data

    restart: always

  postgres:

    image: postgres:9.5.6-alpine

    ports:

     - "15432:5432"

    volumes:

      - ./data/postgresql_data:/var/lib/postgresql/data

    restart: always

  nginx:

    image: redash/nginx:latest

    ports:

      - "88:80"

    depends_on:

      - server

    links:

      - server:redash

    restart: always


 

#创建db

docker-compose -f docker-compose.production.yml run --rm server create_db


 

#运行 redash  后台运行

docker-compose -f docker-compose.production.yml up -d


#如果配置邮箱预警可用以下命令检测 可以接受到邮件,如有问题可检测你的邮件配置

docker exec -it redash_server_1_5309d7faa1d5  python manage.py send_test_mail


注:如果无法正常发送邮件,是无法创建用户的,因为创建用户,需要发送激活邮件,才能正常创建用户

至此redhas安装完成


四、配置域名https访问

upstream redash {

  server 172.17.0.1:5000;

}


server {

  listen 80;

  server_name  redash.yourdomain.com;


  # Allow accessing /ping without https. Useful when placing behind load balancer.

  location /ping {

    proxy_set_header Host $http_host;

    proxy_set_header X-Real-IP $remote_addr;

    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

    proxy_pass       https://redash;

  }


  location / {

    # Enforce SSL.

    return 301 https://$host$request_uri;

  }

}



server {

  listen   443 ssl;

  server_name  redash.yourdomain.com;



  ssl_certificate     /usr/server/nginx/conf/key/2020_yourdomain.com.pem;

  ssl_certificate_key /usr/server/nginx/conf/key/2020_yourdomain.com.key;


  # Specifies that we don't want to use SSLv2 (insecure) or SSLv3 (exploitable)

  ssl_protocols TLSv1 TLSv1.1 TLSv1.2;

  # Uses the server's ciphers rather than the client's

  ssl_prefer_server_ciphers on;

  # Specifies which ciphers are okay and which are not okay. List taken from https://raymii.org/s/tutorials/Strong_SSL_Security_On_nginx.html

  ssl_ciphers "EECDH+AESGCM:EDH+AESGCM:ECDHE-RSA-AES128-GCM-SHA256:AES256+EECDH:DHE-RSA-AES128-GCM-SHA256:AES256+EDH:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA256:ECDHE-RSA-AES256-SHA:ECDHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES128-SHA256:DHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA:AES256-GCM-SHA384:AES128-GCM-SHA256:AES256-SHA256:AES128-SHA256:AES256-SHA:AES128-SHA:DES-CBC3-SHA:HIGH:!aNULL:!eNULL:!EXPORT:!DES:!MD5:!PSK:!RC4";


  gzip on;

  gzip_types *;

  gzip_proxied any;


  location / {

   if ($whiteiplist = 0){

       return 403;

       }

    proxy_set_header Host $http_host;

    proxy_set_header X-Real-IP $remote_addr;

    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

    proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto;

    proxy_redirect   off;

    proxy_pass       https://redash;

  }

access_log  /data/logs/nginx/redash.yourdomain.access.log  main;

}



容器与主机间网络问题排查

今天通过容器,安装了redash,然后通过nginx代理,但是通过宿主机直接访问,是正常的,通过nginx代理,却无法访问,宿主机我开了防火墙,但被访问防火墙,我已开放权限,nginx到宿主机就是不通!

没办法,我抓包查看了下全是TCP Retransmission ,然后怀疑,是容器到宿主机间的网络,出现了问题,估计是ip forward没有开启



vim  /usr/lib/sysctl.d/50-default.conf

net.ipv4.ip_forward=1

vim /etc/sysctl.conf

net.ipv4.ip_forward = 1

sysctl  -p


修改完成后,在通过nginx访问,页面正常了!


java.net.SocketException: Too many open files排障分析

问题:业务组件,有个组件日志,显示 请求异常java.net.SocketException: Too many open files,看上是去是文件句柄数问题,但是系统已经设置过,难道没生效?

说明:系统CentOS Linux release 7.4,组件是通过systemctl方式启动的


1、先查看系统总的文件句柄数

cat /proc/sys/fs/file-nr

# 第一个数为已分配的文件数,第二个为未分配文件数,第三个为最大打开文件句柄数

这个显示是正常的,未超

2、然后查了下 ulimit  -a

open files                      (-n) 65536

也没什么问题

3、查看 /etc/security/limits.d/20-nproc.conf

*       soft    nproc   32000

*       hard    nproc   32000

4、查看我这个组件当前实际进程数

lsof -p PID |wc -l

发现是四千多

5、查看 /proc/进程PID/limits


这时,才发现,原来系统配置的,未生效,最大才4096

最终问题就是在这,通过systemctl启动的service,未能加载/etc/security/limits.d/20-nproc.conf 下的配置


解决办法:

1、直接解决,验证有效

在组件的service文件上,加上LimitNOFILE参数

[Unit]

Description=xxxxxxx组件名

After=syslog.target


[Service]

.....................

LimitNOFILE=1024000

[Install]

WantedBy=multi-user.target

然后重启服务




2、修改/etc/systemd/system.conf(未实测,因要重启服务器)

/etc/systemd/system.conf  添加如下2个参数

DefaultLimitNOFILE=1024000

DefaultLimitNPROC=1024000


以上,就是因systemctl配置文件句柄不生效总结的结果