分布式链路追踪实践(二)

基于 OpenTracing 设计通用的链路追踪设计

Posted by ethan.luo on December 5, 2020

上一章中有对分布式追踪链路方案 (Jaeger, Zipkin, SkyWalking 等) 进行了介绍,不同的方案适用不同场景、团队、开发语言。在接入分布式链路追踪过程中,会有不同的团队使用不同的方案实现,如 A 团队使用的是 Zipkin, 而 B 团队使用的 Jaeger, 在不同的系统中使用存在兼容问题, 我们看看 OpenTracing 如何解决这个问题。

What is OpenTracing?
It is probably easier to start with what OpenTracing is NOT.

OpenTracing is not a download or a program. Distributed tracing requires that software developers add instrumentation to the code of an application, or to the frameworks used in the application.

OpenTracing is not a standard. The Cloud Native Computing Foundation (CNCF) is not an official standards body. The OpenTracing API project is working towards creating more standardized APIs and instrumentation for distributed tracing.

OpenTracing is comprised of an API specification, frameworks and libraries that have implemented the specification, and documentation for the project. OpenTracing allows developers to add instrumentation to their application code using APIs that do not lock them into any one particular product or vendor.

For more information about where OpenTracing has already been implemented, see the list of languages and the list of tracers that support the OpenTracing specification.

Supported tracers

  • CNCF Jaeger / Zikpin
  • LightStep
  • Instana
  • Apache SkyWalking
  • inspectIT
  • stagemonitor
  • Datadog
  • Wavefront by VMware
  • Elastic APM

OpenTracing 规范

OpenTracing 是⼀个轻量级的标准化层,它位于应⽤程序/类库和追踪或⽇志分析程序之间,是⼀套分布式追踪协议,与平台,语⾔⽆关,统⼀接⼝,⽅便开发接⼊不同的分布式追踪系统。

  • 语义规范 : 描述定义的数据模型 Tracer,Sapn 和 SpanContext 等;

  • 语义惯例 : 罗列出 tag 和 logging 操作时,标准的key值;

OpenTracing API

Trace

OpenTracing 中的 Trace(调⽤链)通过归属此链的 Span 来隐性定义。⼀条 Trace 可以认为⼀个有多个 Span 组成的有向⽆环图(DAG图),Span 是⼀个逻辑执⾏单元,Span 与 Span 的因果关系命名为 References。

OpenTracing 定义两种关系:

  • Childof:如下例⼦中, SpanC 是 childof SpanA
  • FollowsFrom:如下例⼦中,SpanG 是 followsFrom SpanF

ltm.png

  • SpanC 是 childof SpanA, SpanG 是 FollowsFrom SpanF

  • childof指的是垂直链路, FollowsFrom 指的是横向链路

Span

Span封装了如下状态

  • 操作名称

  • 开始时间戳

  • 结束时间戳

  • ⼀组零或多个键:值结构的 Span标签 (Tags)。键必须是字符串。值可以是字符串,布尔或数值类型.

  • ⼀组零或多个 Span⽇志 (Logs),其中每个都是⼀个键:值映射并与⼀个时间戳配对。键必须是字符串,值可以是任何类型。 并⾮所有的 OpenTracing 实现都必须⽀持每种值 类型。
  • ⼀个 SpanContext

  • 零或多个因果相关的 Span 间的 References (通过那些相关的 Span 的 SpanContext )

SpanContext 封装了如下状态

  • 任何需要跟跨进程 Span 关联的,依赖于 OpenTracing 实现的状态(例如 Trace 和 Span 的 id)

  • 键:值 结构的跨进程的 Baggage Items(区别于 span tag,baggage 是全局范围,在 span 间保持传递,⽽tag 是 span 内部,不会被⼦ span 继承使⽤。)

Trace/Span Identity

Trace/Span Identity 
Key 
uber-trace-id
▪ Case-insensitive in HTTP
▪ Lower-case in protocols that preserve header case
Value 
{trace-id}:{span-id}:{parent-span-id}:{flags}
▪ {trace-id} ▪ 64-bit or 128-bit random number in base16 format
▪ Can be variable length, shorter values are 0-padded on the left
▪ Clients in some languages support 128-bit, migration pending
▪ Value of 0 is not valid
▪ {span-id} ▪ 64-bit random number in base16 format
▪ Value of 0 is not valid
▪ {parent-span-id} ▪ 64-bit value in base16 format representing parent span id
▪ Deprecated, most Jaeger clients ignore on the receiving side, but still include it on the sending side
▪ 0 value is valid and means “root span” (when not ignored)
▪ {flags} ▪ One byte bitmap, as two hex digits
▪ Bit 1 (right-most, least significant, bit mask 0x01) is “sampled” flag
▪ 1 means the trace is sampled and all downstream services are advised to respect that
▪ 0 means the trace is not sampled and all downstream services are advised to respect that
▪ We’re considering a new feature that allows downstream services to upsample if they find their tracing level is too low
▪ Bit 2 (bit mask 0x02 ) is “debug” flag
▪ Debug flag should only be set when the sampled flag is set
▪ Instructs the backend to try really hard not to drop this trace
▪ Bit 3 (bit mask 0x04 ) is not used
▪ Bit 4 (bit mask 0x08 ) is “firehose” flag
▪ Spans tagged as “firehose” are excluded from being indexed in the storage
▪ The traces can only be retrieved by trace ID (usually available from other sources, like logs)
▪ Other bits are unused

Inject 和 Extract 操作

  1. 跨进程,机器通讯,通过传递 Spancontext 来提供⾜够的信息建⽴ span 间的关系。

  2. SpanContext 通过 Inject 操作向 Carrier 中增加,传递后通过 Extracted 从 Carrier 中取出。

Inject

// TracerWrapper tracer wrapper
func AddTracer(ctx context.Context, r *http.Request, tracer opentracing.Tracer, tags map[string]string) context.Context {
	//初始化 tracer
	opentracing.InitGlobalTracer(tracer)
	var sp opentracing.Span
	//从 header 中获取 span
	spanCtx, _ := opentracing.GlobalTracer().Extract(opentracing.HTTPHeaders, 
		opentracing.HTTPHeadersCarrier(r.Header))
	if spanCtx != nil {
		sp = opentracing.GlobalTracer().StartSpan(r.URL.Path, opentracing.ChildOf(spanCtx))
	}else{
	//如果 header 中没有携带 context, 则新建 span
		sp = tracer.StartSpan(r.URL.Path)
	}
	//写入 tag 或者 日志
	for k, v := range tags {
		// sp.LogKV(k, v)
		// sp.SetTag(k, v)
		sp.LogFields(
		    spanLog.String(k, v),
		)
	}
	//注入span (用于传递)
	if err := opentracing.GlobalTracer().Inject(
		sp.Context(),
		opentracing.HTTPHeaders,
		opentracing.HTTPHeadersCarrier(r.Header)); err != nil {
        logger.Fatalln("inject failed", err)
	}
	...
}

Extracted

// TracerWrapper tracer wrapper
 func AddTracer(ctx context.Context, r *http.Request, tracer opentracing.Tracer, tags map[string]string) context.Context {
 	//初始化 tracer
 	opentracing.InitGlobalTracer(tracer)
 	var sp opentracing.Span
 	//从 header 中获取 span
 	spanCtx, _ := opentracing.GlobalTracer().Extract(opentracing.HTTPHeaders, 
 		opentracing.HTTPHeadersCarrier(r.Header))
 	if spanCtx != nil {
 		sp = opentracing.GlobalTracer().StartSpan(r.URL.Path, opentracing.ChildOf(spanCtx))
 	}else{
 	//如果 header 中没有携带 context, 则新建 span
 		sp = tracer.StartSpan(r.URL.Path)
 	}
 	//写入 tag 或者 日志
 	for k, v := range tags {
 		// sp.LogKV(k, v)
 		// sp.SetTag(k, v)
 		sp.LogFields(
 		    spanLog.String(k, v),
 		)
 	}
 	...
 }
  1. 分布式跟踪(包扩跨服务,进程,主机等)需要对 trace 上下⽂ (spanContext) 进⾏传递,通过 Inject ⽅法, 在 io.Writer 中注⼊上下⽂信息,服务端通过 Extracted 取出,通过对上下⽂和业务逻辑判断 span 之间的关系(childof or followsfrom)。

Sampling,采样

OpenTracing API 不强调采样的概念,但是⼤多数追踪系统通过不同⽅式实现采样。有些情况下,应⽤系统需要通知追踪程序,这条特定的调⽤需要被记录,即使根据默认采样规则,它不需要被记录。sampling.priority tag 提供这样的⽅式。追踪系统不保证⼀定采纳这个参数,但是会尽可能的保留这条调⽤。

sampling.priority - integer

  • 如果⼤于 0, 追踪系统尽可能保存这条调⽤链
  • 等于 0, 追踪系统不保存这条调⽤链
  • 如果此tag没有提供,追踪系统使⽤⾃⼰的默认采样规则

采样速率

⽣产环境系统性能很重要,所以对于所有的请求都开启 Trace 显然会带来⽐较⼤的压⼒,另外,⼤量的数据也会带来很⼤存储压⼒。为此,jaeger ⽀持设置采样速率,根据系统实际情况设置合适的采样频率。

Jaeger 官⽅提供了多种采集策略,使⽤者可以按需选择使⽤:

  • const,全量采集,采样率设置0,1 分别对应打开和关闭;
  • probabilistic,概率采集,默认万份之⼀,0~1之间取值;
  • rateLimiting,限速采集,每秒只能采集⼀定量的数据;
  • remote,⼀种动态采集策略,根据当前系统的访问量调节采集策略;

总结

  1. OpenTracing 设计类似于 Interface 抽象层设计, 很适合不同方案兼容的场景, 并且具有很好的扩展性, 后期改动成本较小;
  2. 适合生产方案迁移 ;

在生产中使用 OpenTracing 需要考虑兼容不同协议的链路追踪问题,下一章和大家分享 支持 HTTP 和 gRPC 分布式链路追踪 SDK 设计 。