前言
可观测性这个词这两年被说烂了,但很多团队的实际情况是:Prometheus管指标、ELK管日志、Jaeger管链路,三套系统各自为战,排查问题时要在三个界面之间跳来跳去。
去年我们开始推OpenTelemetry(简称OTel),目标是统一数据采集标准。折腾了大半年,总算把三大支柱(Metrics、Logs、Traces)串起来了。
这篇文章分享一下我们的落地经验,包括架构设计、踩过的坑和最终效果。
为什么要用OpenTelemetry
先说现状:
+-------------+ +-------------+ +-------------+ | Prometheus | | ELK Stack | | Jaeger | +------+------+ +------+------+ +------+------+ | | | v v v 指标采集SDK 日志采集Agent 链路追踪SDK (各种exporter) (Filebeat/Fluentd) (Jaeger client)问题很明显:
- 技术栈割裂:三套采集方案,三种数据格式
- 上下文断裂:告警触发后,找不到对应的日志和链路
- 维护成本高:每种语言都要适配三套SDK
OpenTelemetry要解决的就是这个问题——统一采集标准:
+------------------+ | OpenTelemetry | | Collector | +--------+---------+ | 统一采集格式 (OTLP协议) | +--------+---------+ | OTel SDK | | (Metrics+Logs+ | | Traces一套搞定) | +------------------+架构设计
我们的最终架构:
±----------------+
| Grafana |
| (统一展示) |
±------±--------+
|
±--------------±--------------±--------------+
| | | |
v v v v
±----------+ ±----------+ ±----------+ ±----------+
| Prometheus| | Loki | | Tempo | | Jaeger |
| (指标) | | (日志) | | (链路) | | (链路备选)|
±----------+ ±----------+ ±----------+ ±----------+
^ ^ ^ ^
| | | |
±--------------±------±------±--------------+
|
±--------±--------+
| OTel Collector |
| (Gateway模式) |
±--------±--------+
^
| OTLP
±--------------±--------------+
| | |
±----±----+ ±----±----+ ±----±----+
| Service A | | Service B | | Service C |
| (OTel SDK)| | (OTel SDK)| | (OTel SDK)|
±----------+ ±----------+ ±----------+
核心思路:
- 应用集成OTel SDK,通过OTLP协议上报数据
- Collector作为网关,统一接收、处理、分发
- 后端存储可以替换,不锁定特定厂商
- Grafana统一展示,Metrics/Logs/Traces互相关联
Collector部署
OpenTelemetry Collector是核心组件,负责数据的接收、处理和导出。
Docker部署
# docker-compose.ymlversion:'3.8'services:otel-collector:image:otel/opentelemetry-collector-contrib:0.92.0container_name:otel-collectorcommand:["--config=/etc/otel-collector-config.yaml"]volumes:-./otel-collector-config.yaml:/etc/otel-collector-config.yamlports:-"4317:4317"# OTLP gRPC-"4318:4318"# OTLP HTTP-"8888:8888"# Collector自身指标-"8889:8889"# Prometheus exporterrestart:unless-stoppedCollector配置
# otel-collector-config.yamlreceivers:otlp:protocols:grpc:endpoint:0.0.0.0:4317http:endpoint:0.0.0.0:4318# 同时支持Prometheus格式(兼容现有监控)prometheus:config:scrape_configs:-job_name:'otel-collector'scrape_interval:10sstatic_configs:-targets:['localhost:8888']processors:# 批量处理,减少网络开销batch:timeout:5ssend_batch_size:1000# 内存限制,防止OOMmemory_limiter:check_interval:1slimit_mib:1000spike_limit_mib:200# 添加通用属性resource:attributes:-key:deployment.environmentvalue:productionaction:upsertexporters:# 指标 -> Prometheusprometheus:endpoint:"0.0.0.0:8889"namespace:otel# 链路 -> Tempootlp/tempo:endpoint:tempo:4317tls:insecure:true# 日志 -> Lokiloki:endpoint:http://loki:3100/loki/api/v1/pushlabels:attributes:service.name:"service_name"level:"severity"# 调试用logging:verbosity:detailedservice:pipelines:traces:receivers:[otlp]processors:[memory_limiter,batch]exporters:[otlp/tempo]metrics:receivers:[otlp,prometheus]processors:[memory_limiter,batch]exporters:[prometheus]logs:receivers:[otlp]processors:[memory_limiter,batch,resource]exporters:[loki]应用接入
Go服务接入
packagemainimport("context""log""net/http""time""go.opentelemetry.io/otel""go.opentelemetry.io/otel/attribute""go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc""go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc""go.opentelemetry.io/otel/sdk/metric""go.opentelemetry.io/otel/sdk/resource""go.opentelemetry.io/otel/sdk/trace"semconv"go.opentelemetry.io/otel/semconv/v1.21.0")funcinitTracer()(*trace.TracerProvider,error){exporter,err:=otlptracegrpc.New(context.Background(),otlptracegrpc.WithEndpoint("otel-collector:4317"),otlptracegrpc.WithInsecure(),)iferr!=nil{returnnil,err}res:=resource.NewWithAttributes(semconv.SchemaURL,semconv.ServiceName("user-service"),semconv.ServiceVersion("1.0.0"),attribute.String("environment","production"),)tp:=trace.NewTracerProvider(trace.WithBatcher(exporter),trace.WithResource(res),trace.WithSampler(trace.TraceIDRatioBased(0.1)),// 采样率10%)otel.SetTracerProvider(tp)returntp,nil}funcinitMeter()(*metric.MeterProvider,error){exporter,err:=otlpmetricgrpc.New(context.Background(),otlpmetricgrpc.WithEndpoint("otel-collector:4317"),otlpmetricgrpc.WithInsecure(),)iferr!=nil{returnnil,err}mp:=metric.NewMeterProvider(metric.WithReader(metric.NewPeriodicReader(exporter,metric.WithInterval(15*time.Second))),)otel.SetMeterProvider(mp)returnmp,nil}funcmain(){tp,_:=initTracer()defertp.Shutdown(context.Background())mp,_:=initMeter()defermp.Shutdown(context.Background())tracer:=otel.Tracer("user-service")meter:=otel.Meter("user-service")// 创建指标requestCounter,_:=meter.Int64Counter("http_requests_total")requestDuration,_:=meter.Float64Histogram("http_request_duration_seconds")http.HandleFunc("/api/user",func(w http.ResponseWriter,r*http.Request){ctx,span:=tracer.Start(r.Context(),"GetUser")deferspan.End()start:=time.Now()// 业务逻辑span.SetAttributes(attribute.String("user.id",r.URL.Query().Get("id")))// 模拟数据库查询_,dbSpan:=tracer.Start(ctx,"DB.Query")time.Sleep(50*time.Millisecond)dbSpan.End()// 记录指标requestCounter.Add(ctx,1,attribute.String("method",r.Method))requestDuration.Record(ctx,time.Since(start).Seconds())w.Write([]byte(`{"name": "test"}`))})log.Println("Server starting on :8080")http.ListenAndServe(":8080",nil)}Java服务接入
Java用Agent方式更方便,不用改代码:
# 下载Agentwgethttps://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v2.1.0/opentelemetry-javaagent.jar# 启动时加参数java -javaagent:opentelemetry-javaagent.jar\-Dotel.service.name=order-service\-Dotel.exporter.otlp.endpoint=http://otel-collector:4317\-Dotel.traces.sampler=traceidratio\-Dotel.traces.sampler.arg=0.1\-jar order-service.jar自动埋点支持:HTTP请求、数据库调用、Redis、Kafka等,开箱即用。
Python服务接入
fromopentelemetryimporttrace,metricsfromopentelemetry.sdk.traceimportTracerProviderfromopentelemetry.sdk.metricsimportMeterProviderfromopentelemetry.exporter.otlp.proto.grpc.trace_exporterimportOTLPSpanExporterfromopentelemetry.exporter.otlp.proto.grpc.metric_exporterimportOTLPMetricExporterfromopentelemetry.sdk.trace.exportimportBatchSpanProcessorfromopentelemetry.sdk.metrics.exportimportPeriodicExportingMetricReaderfromopentelemetry.sdk.resourcesimportResource resource=Resource.create({"service.name":"payment-service"})# 配置Tracertrace_provider=TracerProvider(resource=resource)trace_exporter=OTLPSpanExporter(endpoint="otel-collector:4317",insecure=True)trace_provider.add_span_processor(BatchSpanProcessor(trace_exporter))trace.set_tracer_provider(trace_provider)# 配置Metermetric_reader=PeriodicExportingMetricReader(OTLPMetricExporter(endpoint="otel-collector:4317",insecure=True),export_interval_millis=15000)meter_provider=MeterProvider(resource=resource,metric_readers=[metric_reader])metrics.set_meter_provider(meter_provider)tracer=trace.get_tracer("payment-service")meter=metrics.get_meter("payment-service")# 使用@tracer.start_as_current_span("process_payment")defprocess_payment(order_id):span=trace.get_current_span()span.set_attribute("order.id",order_id)# 业务逻辑...关联Metrics、Logs、Traces
这是OTel最有价值的部分——三大支柱的关联。
TraceID注入日志
import("go.opentelemetry.io/otel/trace""go.uber.org/zap")funcLogWithTrace(ctx context.Context,logger*zap.Logger)*zap.Logger{span:=trace.SpanFromContext(ctx)ifspan.SpanContext().IsValid(){returnlogger.With(zap.String("trace_id",span.SpanContext().TraceID().String()),zap.String("span_id",span.SpanContext().SpanID().String()),)}returnlogger}// 使用funchandleRequest(ctx context.Context){logger:=LogWithTrace(ctx,zap.L())logger.Info("Processing request",zap.String("user_id","123"))}日志里带上trace_id后,在Grafana里可以直接从日志跳转到对应的链路。
Exemplar关联
Prometheus 2.25+支持Exemplar,把指标和TraceID关联:
// 记录指标时带上TraceIDrequestDuration.Record(ctx,duration,metric.WithAttributes(attribute.String("method","GET")),)Grafana看到指标异常时,可以直接跳转到具体的链路追踪。
Grafana配置
数据源配置
# grafana/provisioning/datasources/datasources.yamlapiVersion:1datasources:-name:Prometheustype:prometheusurl:http://prometheus:9090isDefault:true-name:Tempotype:tempourl:http://tempo:3200jsonData:tracesToLogs:datasourceUid:lokitags:['service.name']mappedTags:[{key:'service.name',value:'service_name'}]mapTagNamesEnabled:true-name:Lokitype:lokiurl:http://loki:3100jsonData:derivedFields:-datasourceUid:tempomatcherRegex:'"trace_id":"(\w+)"'name:TraceIDurl:'$${__value.raw}'效果
配置好后,排查问题的体验:
- Prometheus告警:某服务P99延迟飙升
- 点击Exemplar:跳转到具体的慢请求链路
- 在Tempo看链路:发现DB查询耗时异常
- 从链路跳转日志:看到具体的SQL和错误信息
整个链路打通,效率提升太多。
生产经验
采样策略
全量采集不现实,要设置采样率:
// 尾部采样:异常请求一定采集trace.NewTracerProvider(trace.WithSampler(trace.ParentBased(trace.TraceIDRatioBased(0.1),// 正常请求10%采样)),)更智能的做法是用Collector做尾部采样:
processors:tail_sampling:decision_wait:10spolicies:# 错误一定采集-name:errorstype:status_codestatus_code:{status_codes:[ERROR]}# 慢请求一定采集-name:slow-requeststype:latencylatency:{threshold_ms:1000}# 其他随机采样-name:randomizedtype:probabilisticprobabilistic:{sampling_percentage:10}资源控制
Collector本身也需要监控和限制:
processors:memory_limiter:check_interval:1slimit_mib:2000spike_limit_mib:400extensions:health_check:endpoint::13133zpages:endpoint::55679# 调试页面多集群管理
我们有三个Kubernetes集群,每个集群部署一个Collector。管理这些Collector时,我用星空组网把三个集群的内网打通,Grafana统一查询所有集群的数据。不然每个集群单独配一套Grafana,运维成本太高。
踩过的坑
坑1:Collector内存暴涨
刚上线时Collector经常OOM。原因是batch processor积攒太多数据。
解决:加memory_limiter,调小batch size
坑2:SDK版本不一致
不同服务用的OTel SDK版本不一样,导致数据格式有差异。
解决:统一SDK版本,在Collector用transform processor做兼容
坑3:日志量太大
OTel日志采集默认全量,Loki扛不住。
解决:在应用层过滤,只采集ERROR及以上级别;或者在Collector用filter processor
processors:filter:logs:exclude:match_type:strictseverity_texts:["DEBUG","INFO"]总结
OpenTelemetry带来的改变:
- 统一标准:一套SDK搞定三大支柱
- 数据关联:从指标到链路到日志,一键跳转
- 厂商中立:后端存储可以随时换
- 社区活跃:主流语言和框架都有官方支持
落地成本确实不低,但长期收益明显。特别是排查线上问题时,能快速定位到具体代码,这个效率提升是实打实的。
建议新项目直接用OTel,老项目可以逐步迁移——先接Collector,再慢慢替换各服务的SDK。