首页手记万字长文，k8s之父带你阅读 deployment 源码

万字长文，k8s之父带你阅读 deployment 源码

标签：

4.3 deployment controller 01

Deployment Controller 是 Kube-Controller-Manager 中最常用的 Controller 之一管理 Deployment 资源。而 Deployment 的本质就是通过管理 ReplicaSet 和 Pod 在 Kubernetes 集群中部署 无状态 Workload。

Deployment 与控制器模式

在 K8s 中，pod 是最小的资源单位，而 pod 的副本管理是通过 ReplicaSet(RS) 实现的；而 deployment 实则是基于 RS 做了更上层的工作。

Deployment控制器是建立在ReplicaSet (rs)之上的一个控制器，可以管理多个rs，每次更新镜像版本，都会生成一个新的rs，把旧的rs替换掉，多个rs同时存在，但是只有一个rs运行。

通过Deployment对象，你可以轻松的做到以下事情：

创建ReplicaSet和Pod
滚动升级（不停止旧服务的状态下升级）和回滚应用（将应用回滚到之前的版本）
平滑地扩容和缩容
暂停和继续Deployment

deployment 操作

1.Deployment资源定义规范

2.Deployment示例

3. 应用配置清单

root@k8s-master01: kubectl apply -f deployment-demo.yaml
deployment.apps/deployment created

# 查看deployment信息
root@k8s-master01:~/yaml/chapter08# kubectl get deployments.apps
NAME              READY   UP-TO-DATE   AVAILABLE   AGE
deployment-demo   4/4     4            4           12s

# 查看pod信息
root@k8s-master01:~/yaml/chapter08# kubectl get pods -l 'app=demoapp,release=stable'
NAME                              READY   STATUS    RESTARTS   AGE
deployment-demo-fb544c5d8-2687q   1/1     Running   0          2m16s
deployment-demo-fb544c5d8-2t6q4   1/1     Running   0          2m16s
deployment-demo-fb544c5d8-pkgzn   1/1     Running   0          2m16s
deployment-demo-fb544c5d8-w52qp   1/1     Running   0          2m16s
# 可以看到第一段为deployment名字，最后一段为随机值，中间的fb544c5d8为replicaset中Pod模板的哈希值，也就是template字段的哈希值

# 查看replicaset信息
root@k8s-master01:~/yaml/chapter08# kubectl get replicasets.apps
NAME                        DESIRED   CURRENT   READY   AGE
deployment-demo-fb544c5d8   4         4         4       4m5s

# 一旦pod模板发生变更，会导致ReplicaSet的哈希值发生变化，然后出发deployment更新的

4.查看`deployment`的描述信息

root@k8s-master01:~/yaml/chapter08# kubectl describe deployments.apps deployment-demo
Name:                   deployment-demo
Namespace:              default
CreationTimestamp:      Wed, 21 Apr 2024 13:23:13 +0000
Labels:                 <none>
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app=demoapp,release=stable
Replicas:               4 desired | 4 updated | 4 total | 4 available | 0 unavailable
StrategyType:           RollingUpdate      # 一旦模板发生变化将触发滚动跟新
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge     # 滚动更新逻辑
Pod Template:
  Labels:  app=demoapp
           release=stable
  Containers:
   demoapp:
    Image:        ikubernetes/demoapp:v1.0
    Port:         80/TCP
    Host Port:    0/TCP
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   deployment-demo-fb544c5d8 (4/4 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  19m   deployment-controller  Scaled up replica set deployment-demo-fb544c5d8 to 4

5.更新

使用 kubectl scale 命令扩容 Deployment 到 5 个副本:

kubectl scale deployments/deployment-demo --replicas=5

要更新应用程序的镜像版本到 v2，请使用 set image 子命令，后面给出 Deployment 名称和新的镜像版本:

kubectl set image deployments/deployment-demo ikubernetes/demoapp:v1.0=ikubernetes/demoapp:v2.0

回滚：

// 查看deployment的更新历史信息
root@k8s-master01:~# kubectl rollout history deployment deployment-demo
deployment.apps/deployment-demo
REVISION  CHANGE-CAUSE
1         <none>
2         <none>   # 此为当前版本信息


// 快速回滚到上一个版本的Deployment，可以使用以下命令：
root@k8s-master01:~# kubectl rollout undo deployment deployment-demo
deployment.apps/deployment-demo rolled back

// 回滚指定的版本
kubectl rollout undo deployment <deployment-name> --to-revision=<revision-number>

工作流程

dc 的 Informer 主要监听三种资源，Deployment，ReplicaSet，Pod。其中 Deployment，ReplicaSet 监听 Add, Update, Delete。 Pod 只监听 Delete 事件。

// DeploymentController is responsible for synchronizing Deployment objects stored
// in the system with actual running replica sets and pods.
type DeploymentController struct {
	// rsControl is used for adopting/releasing replica sets.
	rsControl controller.RSControlInterface
	client    clientset.Interface

	eventBroadcaster record.EventBroadcaster
	eventRecorder    record.EventRecorder

	// To allow injection of syncDeployment for testing.
	syncHandler func(ctx context.Context, dKey string) error
	// used for unit testing
	enqueueDeployment func(deployment *apps.Deployment)

	// dLister can list/get deployments from the shared informer's store
	dLister appslisters.DeploymentLister
	// rsLister can list/get replica sets from the shared informer's store
	rsLister appslisters.ReplicaSetLister
	// podLister can list/get pods from the shared informer's store
	podLister corelisters.PodLister

	// dListerSynced returns true if the Deployment store has been synced at least once.
	// Added as a member to the struct to allow injection for testing.
	dListerSynced cache.InformerSynced
	// rsListerSynced returns true if the ReplicaSet store has been synced at least once.
	// Added as a member to the struct to allow injection for testing.
	rsListerSynced cache.InformerSynced
	// podListerSynced returns true if the pod store has been synced at least once.
	// Added as a member to the struct to allow injection for testing.
	podListerSynced cache.InformerSynced

	// Deployments that need to be synced
	queue workqueue.TypedRateLimitingInterface[string]
}


func startDeploymentController(ctx context.Context, controllerContext ControllerContext, controllerName string) (controller.Interface, bool, error) {
	dc, err := deployment.NewDeploymentController(
		ctx,
		controllerContext.InformerFactory.Apps().V1().Deployments(),
		controllerContext.InformerFactory.Apps().V1().ReplicaSets(),
		controllerContext.InformerFactory.Core().V1().Pods(),
		controllerContext.ClientBuilder.ClientOrDie("deployment-controller"),
	)
	if err != nil {
		return nil, true, fmt.Errorf("error creating Deployment controller: %v", err)
	}
	go dc.Run(ctx, int(controllerContext.ComponentConfig.DeploymentController.ConcurrentDeploymentSyncs))
	return nil, true, nil
}

接下来我们接着看 DeploymentController 的核心处理逻辑设计

1 dc.syncHandler = dc.syncDeployment
2 dc.enqueueDeployment = dc.enqueue

dc 的核心就是一个 Deployment 队列 enqueueDeployment，一个Deployment 同步器 syncHandler

跟踪 enqueueDeployment 可以看到在注册的 Informer 中所有可以关联到 Deployment 的事件都会调用 enqueueDeployment 并把 Deployment 对象传给它。enqueue 被初始化为 enqueueDeployment，其实就是传给了 enqueue，enqueue 方法则是提取 Deployment 对象中的属性拼成字符串添加到 queue 中。queue 是一个可以限速的队列 workqueue.TypedRateLimitingInterface[string]

func (dc *DeploymentController) enqueue(deployment *apps.Deployment) {
	key, err := controller.KeyFunc(deployment)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", deployment, err))
		return
	}

	dc.queue.Add(key)
}

接下来继续看syncHandler，跟着源码可以看到 ControllerManager 在启动 Controller 时候调用 run 方法，在 run （worker 调用）方法中调用了 syncHandler 相关代码:

// Run begins watching and syncing.
func (dc *DeploymentController) Run(ctx context.Context, workers int) {
	defer utilruntime.HandleCrash()

	// Start events processing pipeline.
	dc.eventBroadcaster.StartStructuredLogging(3)
	dc.eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: dc.client.CoreV1().Events("")})
	defer dc.eventBroadcaster.Shutdown()

	defer dc.queue.ShutDown()

	logger := klog.FromContext(ctx)
	logger.Info("Starting controller", "controller", "deployment")
	defer logger.Info("Shutting down controller", "controller", "deployment")

	if !cache.WaitForNamedCacheSync("deployment", ctx.Done(), dc.dListerSynced, dc.rsListerSynced, dc.podListerSynced) {
		return
	}

	for i := 0; i < workers; i++ {
		go wait.UntilWithContext(ctx, dc.worker, time.Second)
	}

	<-ctx.Done()
}

// worker runs a worker thread that just dequeues items, processes them, and marks them done.
// It enforces that the syncHandler is never invoked concurrently with the same key.
func (dc *DeploymentController) worker(ctx context.Context) {
	for dc.processNextWorkItem(ctx) {
	}
}

func (dc *DeploymentController) processNextWorkItem(ctx context.Context) bool {
	key, quit := dc.queue.Get()
	if quit {
		return false
	}
	defer dc.queue.Done(key)

	err := dc.syncHandler(ctx, key)
	dc.handleErr(ctx, err, key)

	return true
}

dc.worker函数很简单，循环的调用processNextWorkItem，processNextWorkItem去队列中取元素（deployment），然后调用syncHandler进行处理，这里的syncHandler就是上面注册的startDeploymentController函数。如果队列是空的话，processNextWorkItem会一直阻塞在dc.queue.Get()这一步。如果取到了元素，调用syncDeployment进行处理。上面的步骤可以简单描述为下图：

// 我们着重分析syncDeployment函数：
// syncDeployment will sync the deployment with the given key.
// This function is not meant to be invoked concurrently with the same key.
func (dc *DeploymentController) syncDeployment(ctx context.Context, key string) error {
	logger := klog.FromContext(ctx)
	namespace, name, err := cache.SplitMetaNamespaceKey(key)
	if err != nil {
		logger.Error(err, "Failed to split meta namespace cache key", "cacheKey", key)
		return err
	}

	startTime := time.Now()
	logger.V(4).Info("Started syncing deployment", "deployment", klog.KRef(namespace, name), "startTime", startTime)
	defer func() {
		logger.V(4).Info("Finished syncing deployment", "deployment", klog.KRef(namespace, name), "duration", time.Since(startTime))
	}()

	deployment, err := dc.dLister.Deployments(namespace).Get(name)
	if errors.IsNotFound(err) {
		logger.V(2).Info("Deployment has been deleted", "deployment", klog.KRef(namespace, name))
		return nil
	}
	if err != nil {
		return err
	}

	// Deep-copy otherwise we are mutating our cache.
	// TODO: Deep-copy only when needed.
	d := deployment.DeepCopy()

	everything := metav1.LabelSelector{}
	if reflect.DeepEqual(d.Spec.Selector, &everything) {
		dc.eventRecorder.Eventf(d, v1.EventTypeWarning, "SelectingAll", "This deployment is selecting all pods. A non-empty selector is required.")
		if d.Status.ObservedGeneration < d.Generation {
			d.Status.ObservedGeneration = d.Generation
			dc.client.AppsV1().Deployments(d.Namespace).UpdateStatus(ctx, d, metav1.UpdateOptions{})
		}
		return nil
	}

	// List ReplicaSets owned by this Deployment, while reconciling ControllerRef
	// through adoption/orphaning.
	rsList, err := dc.getReplicaSetsForDeployment(ctx, d)
	if err != nil {
		return err
	}
	// List all Pods owned by this Deployment, grouped by their ReplicaSet.
	// Current uses of the podMap are:
	//
	// * check if a Pod is labeled correctly with the pod-template-hash label.
	// * check that no old Pods are running in the middle of Recreate Deployments.
	podMap, err := dc.getPodMapForDeployment(d, rsList)
	if err != nil {
		return err
	}

	if d.DeletionTimestamp != nil {
		return dc.syncStatusOnly(ctx, d, rsList)
	}

	// Update deployment conditions with an Unknown condition when pausing/resuming
	// a deployment. In this way, we can be sure that we won't timeout when a user
	// resumes a Deployment with a set progressDeadlineSeconds.
	if err = dc.checkPausedConditions(ctx, d); err != nil {
		return err
	}

	if d.Spec.Paused {
		return dc.sync(ctx, d, rsList)
	}

	// rollback is not re-entrant in case the underlying replica sets are updated with a new
	// revision so we should ensure that we won't proceed to update replica sets until we
	// make sure that the deployment has cleaned up its rollback spec in subsequent enqueues.
	if getRollbackTo(d) != nil {
		return dc.rollback(ctx, d, rsList)
	}

	scalingEvent, err := dc.isScalingEvent(ctx, d, rsList)
	if err != nil {
		return err
	}
	if scalingEvent {
		return dc.sync(ctx, d, rsList)
	}

	switch d.Spec.Strategy.Type {
	case apps.RecreateDeploymentStrategyType:
		return dc.rolloutRecreate(ctx, d, rsList, podMap)
	case apps.RollingUpdateDeploymentStrategyType:
		return dc.rolloutRolling(ctx, d, rsList)
	}
	return fmt.Errorf("unexpected deployment strategy type: %s", d.Spec.Strategy.Type)
}

取出 deployment对象

进入该函数，先从队列里取出来的元素（格式为namespace/dp_name）中提取出dp和所属的命名空间，然后根据命名空间从本地indexer中取出dp对象（就是我们实际看到的完整的dp yaml的go对象）

这里的indexer是一个带索引的存储，informer在感知到资源变化后，将获取到的资源（如dp）以key/value的方式存储在index中（一个线程安全的map）。

	namespace, name, err := cache.SplitMetaNamespaceKey(key)
	if err != nil {
		logger.Error(err, "Failed to split meta namespace cache key", "cacheKey", key)
		return err
	}

如果发现该dp在本地存储中不存在了，那么说明dp已经被删除了，那么就什么都不做返回。

这里需要说明的是，如用户调用api或者执行kubectl delete命令删除dp时，此时dp不会被直接删除，api server只会在dp的metadata中加上deletionTimestamp字段，然后以HTTP202码返回（表示接受），真正的删除工作是由垃圾收集器（可参考https://kubernetes.io/docs/concepts/overview/working-with-objects/finalizers/）完成的，当垃圾收集器删除dp后，dc中注册的informer感知到资源变化，就会删除本地缓存中的dp，于是就走到了这里的IsNotFound。


if errors.IsNotFound(err) {
  klog.V(2).InfoS("Deployment has been deleted", "deployment", klog.KRef(namespace, name))
  return nil
}

判断dp的selector字段，如果为空的话，那么什么都不做，直接返回，因为没有资源能被它管理。

everything := metav1.LabelSelector{}
	if reflect.DeepEqual(d.Spec.Selector, &everything) {
		dc.eventRecorder.Eventf(d, v1.EventTypeWarning, "SelectingAll", "This deployment is selecting all pods. A non-empty selector is required.")
		if d.Status.ObservedGeneration < d.Generation {
			d.Status.ObservedGeneration = d.Generation
			dc.client.AppsV1().Deployments(d.Namespace).UpdateStatus(ctx, d, metav1.UpdateOptions{})
		}
		return nil
	}

根据标签从本地缓存获取属于该dp的rs放入集合中，有如下情况：

本来rs就属于dp（指的是dp的selector和rs的labels匹配），rs的ownerReference属于该dp，放入集合
rs的labels和dp的selector一样，但是还没有ownerReference，这种属于新建的rs，需要dp领养，放入集合
rs的labels和dp的selector不一样，但是ownerReference是该dp，此时dp需要释放该rs（删除ownerReferencr），使它成为孤儿，会被垃圾收集器删除

我们可以想象下我们平时的哪些操作会出现上面三种情况？

针对于1：我们对dp做了扩缩容（修改replicas字段），这样改变rs和dp的属主关系
针对于2：我们对dp的template做了修改，例如升级，修改了镜像，此时会有新的rs被创建出来
针对于3：修改了dp的selector，和dp匹配的rs现在不匹配了

	// List ReplicaSets owned by this Deployment, while reconciling ControllerRef
	// through adoption/orphaning.
	rsList, err := dc.getReplicaSetsForDeployment(ctx, d)
	if err != nil {
		return err
	}

判断是否有删除标记

判断dp是否有deletionTimestamp字段。我们在上面谈到过，通过api或kubectl删除dp，dp不会直接被删除，而是在dp的metadata中加入deletionTimestamp。如果存在deletionTimestamp，则调用syncStatusOnly同步状态，那么这个状态是什么？

针对deletionTimestamp不为空的场景下，由于只是metadata发生变化，所以该函数没有对dp和rs做改变。

if d.DeletionTimestamp != nil {
    return dc.syncStatusOnly(ctx, d, rsList)
}

判断是否是暂停操作

 if err = dc.checkPausedConditions(ctx, d); err != nil {
    return err
  }
    if d.Spec.Paused {
    return dc.sync(ctx, d, rsList)
  }

如果用户pause dp，如调用如下命令

kubectl rollout pause deploy/nginx

那么dp的spec字段会被加入paused: true的标记

此时在checkPausedConditions函数中会给dp的staus字段加上如下condition"


  - lastTransitionTime: "2023-03-21T14:03:52Z"
    lastUpdateTime: "2023-03-21T14:03:52Z"
    message: Deployment is paused
    reason: DeploymentPaused
    status: Unknown
    type: Progressing

判断是否是回滚操作

接着判断dp的变化是不是由于回滚引起的，如果是的话执行rollback开始回滚

	// rollback is not re-entrant in case the underlying replica sets are updated with a new
	// revision so we should ensure that we won't proceed to update replica sets until we
	// make sure that the deployment has cleaned up its rollback spec in subsequent enqueues.
	if getRollbackTo(d) != nil {
		return dc.rollback(ctx, d, rsList)
	}

回滚的逻辑：判断回滚的版本是不是0，如果是0代表回滚到上个版本，找到上个版本，如果找不到上个版本，则放弃回滚；如果不是0，则找到对应版本（revision）的rs，把rs的template拷贝到dp中，然后调用接口更新（如果目标版本rs和现在dp的pod template一样，也不会滚动）。

判断是不是扩缩容事件

判断的依据是当前dp的replicas是不是和活跃（有pod副本的rs）的rs deployment.kubernetes.io/desired-replicas一样，如果不一样，那么需要调用sync回滚:

	scalingEvent, err := dc.isScalingEvent(ctx, d, rsList)
	if err != nil {
		return err
	}
	if scalingEvent {
		return dc.sync(ctx, d, rsList)
	}

选定更新策略

最后就来到了dc的最后一步判断，根据dp的更新策略是Recreate还是RollingUpdate：

	switch d.Spec.Strategy.Type {
	case apps.RecreateDeploymentStrategyType:
		return dc.rolloutRecreate(ctx, d, rsList, podMap)
	case apps.RollingUpdateDeploymentStrategyType:
		return dc.rolloutRolling(ctx, d, rsList)
	}

dp 的 RollingUpdate 是比较常见的，在具体介绍滚动更新的流程之前，我们首先需要了解滚动更新策略使用的两个参数 maxUnavailable 和 maxSurge：maxUnavailable 表示在更新过程中能够进入不可用状态的 Pod 的最大值；

maxSurge 表示能够额外创建的 Pod 个数；
maxUnavailable 和 maxSurge 这两个滚动更新的配置都可以使用绝对值或者百分比表示，使用百分比时需要用 Replicas

// rolloutRolling implements the logic for rolling a new replica set.
func (dc *DeploymentController) rolloutRolling(ctx context.Context, d *apps.Deployment, rsList []*apps.ReplicaSet) error {
	newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(ctx, d, rsList, true)
	if err != nil {
		return err
	}
	allRSs := append(oldRSs, newRS)

	// Scale up, if we can.
	scaledUp, err := dc.reconcileNewReplicaSet(ctx, allRSs, newRS, d)
	if err != nil {
		return err
	}
	if scaledUp {
		// Update DeploymentStatus
		return dc.syncRolloutStatus(ctx, allRSs, newRS, d)
	}

	// Scale down, if we can.
	scaledDown, err := dc.reconcileOldReplicaSets(ctx, allRSs, controller.FilterActiveReplicaSets(oldRSs), newRS, d)
	if err != nil {
		return err
	}
	if scaledDown {
		// Update DeploymentStatus
		return dc.syncRolloutStatus(ctx, allRSs, newRS, d)
	}

	if deploymentutil.DeploymentComplete(d, &d.Status) {
		if err := dc.cleanupDeployment(ctx, oldRSs, d); err != nil {
			return err
		}
	}

	// Sync deployment status
	return dc.syncRolloutStatus(ctx, allRSs, newRS, d)
}

首先获取 Deployment 对应的全部 ReplicaSet 资源；
通过 reconcileNewReplicaSet 调解新 ReplicaSet 的副本数，创建新的 Pod 并保证额外的副本数量不超过 maxSurge；
通过 reconcileOldReplicaSets 调解历史 ReplicaSet 的副本数，删除旧的 Pod 并保证不可用的部分数不会超过 maxUnavailable；
最后删除无用的 ReplicaSet 并更新 Deployment 的状态；

dp的另外一个更新模式就是Recreate，该模式比较粗暴，直接将旧的rs副本数设置为0，然后重新创建rs，这样旧的pod也会全部一次性结束，会导致一定时间内服务不可用，所以这个模式一般不会使用到

dc整体的逻辑:

4.4 deployment controller 02

上节课复习 syncDeployment()

主要逻辑：
（1）获取执行方法时的当前时间，并定义defer函数，用于计算该方法总执行时间，也即统计对一个 deployment 进行同步调谐操作的耗时；
（2）根据 deployment 对象的命名空间与名称，获取 deployment 对象；
（3）调用dc.getReplicaSetsForDeployment：对集群中与deployment对象相同命名空间下的所有replicaset对象做处理，若发现匹配但没有关联 deployment 的 replicaset 则通过设置 ownerReferences 字段与 deployment 关联，已关联但不匹配的则删除对应的 ownerReferences，最后获取返回集群中与 Deployment 关联匹配的 ReplicaSet对象列表；
（4）调用dc.getPodMapForDeployment：根据deployment对象的selector，获取当前 deployment 对象关联的 pod，根据 deployment 所属的 replicaset 对象的UID对 pod 进行分类并返回，返回值类型为map[types.UID][]*v1.Pod；
（5）如果 deployment 对象的 DeletionTimestamp 属性值不为空，则调用dc.syncStatusOnly，根据deployment 所属的 replicaset 对象，重新计算出 deployment 对象的status字段值并更新，调用完成后，直接return，不继续往下执行；
（6）调用dc.checkPausedConditions：检查 deployment 是否为pause状态，是则更新deployment对象的status字段值，为其添加pause相关的condition；
（7）判断deployment对象的.Spec.Paused属性值，为true时，则调用dc.sync做处理，调用完成后直接return；
（8）调用getRollbackTo检查deployment对象的annotations中是否有以下key：deprecated.deployment.rollback.to，如果有且值不为空，调用 dc.rollback 方法执行回滚操作；
（9）调用dc.isScalingEvent：检查deployment对象是否处于 scaling 状态，是则调用dc.sync做扩缩容处理，调用完成后直接return；
（10）判断deployment对象的更新策略，当更新策略为Recreate时调用dc.rolloutRecreate做进一步处理，也即对deployment进行recreate更新处理；当更新策略为RollingUpdate时调用dc.rolloutRolling做进一步处理，也即对deployment进行滚动更新处理。

getReplicaSetFromDeployment

主要逻辑

dc.getReplicaSetsForDeployment主要作用：获取集群中与 Deployment 相关的 ReplicaSet，若发现匹配但没有关联 deployment 的 replicaset 则通过设置 ownerReferences 字段与 deployment 关联，已关联但不匹配的则删除对应的 ownerReferences。

主要逻辑如下：
（1）获取deployment对象命名空间下的所有replicaset对象；
（2）调用cm.ClaimReplicaSets对replicaset做进一步处理，并最终返回与deployment匹配关联的replicaset对象列表。

	// List ReplicaSets owned by this Deployment, while reconciling ControllerRef
	// through adoption/orphaning.
	rsList, err := dc.getReplicaSetsForDeployment(ctx, d)
	if err != nil {
		return err
	}

// getReplicaSetsForDeployment uses ControllerRefManager to reconcile
// ControllerRef by adopting and orphaning.
// It returns the list of ReplicaSets that this Deployment should manage.
func (dc *DeploymentController) getReplicaSetsForDeployment(ctx context.Context, d *apps.Deployment) ([]*apps.ReplicaSet, error) {
	// List all ReplicaSets to find those we own but that no longer match our
	// selector. They will be orphaned by ClaimReplicaSets().
	rsList, err := dc.rsLister.ReplicaSets(d.Namespace).List(labels.Everything())
	if err != nil {
		return nil, err
	}
	deploymentSelector, err := metav1.LabelSelectorAsSelector(d.Spec.Selector)
	if err != nil {
		return nil, fmt.Errorf("deployment %s/%s has invalid label selector: %v", d.Namespace, d.Name, err)
	}
	// If any adoptions are attempted, we should first recheck for deletion with
	// an uncached quorum read sometime after listing ReplicaSets (see #42639).
	canAdoptFunc := controller.RecheckDeletionTimestamp(func(ctx context.Context) (metav1.Object, error) {
		fresh, err := dc.client.AppsV1().Deployments(d.Namespace).Get(ctx, d.Name, metav1.GetOptions{})
		if err != nil {
			return nil, err
		}
		if fresh.UID != d.UID {
			return nil, fmt.Errorf("original Deployment %v/%v is gone: got uid %v, wanted %v", d.Namespace, d.Name, fresh.UID, d.UID)
		}
		return fresh, nil
	})
	cm := controller.NewReplicaSetControllerRefManager(dc.rsControl, d, deploymentSelector, controllerKind, canAdoptFunc)
	return cm.ClaimReplicaSets(ctx, rsList)
}

	// If any adoptions are attempted, we should first recheck for deletion with
	// an uncached quorum read sometime after listing ReplicaSets (see #42639).
	canAdoptFunc := controller.RecheckDeletionTimestamp(func(ctx context.Context) (metav1.Object, error) {
		fresh, err := dc.client.AppsV1().Deployments(d.Namespace).Get(ctx, d.Name, metav1.GetOptions{})
		if err != nil {
			return nil, err
		}
		if fresh.UID != d.UID {
			return nil, fmt.Errorf("original Deployment %v/%v is gone: got uid %v, wanted %v", d.Namespace, d.Name, fresh.UID, d.UID)
		}
		return fresh, nil
	})
	cm := controller.NewReplicaSetControllerRefManager(dc.rsControl, d, deploymentSelector, controllerKind, canAdoptFunc)

cm.ClaimReplicaSets

遍历与deployment对象相同命名空间下的所有replicaset对象，调用m.ClaimObject做处理，m.ClaimObject的作用主要是将匹配但没有关联 deployment 的 replicaset 则通过设置 ownerReferences 字段与 deployment 关联，已关联但不匹配的则删除对应的ownerReferences。

// ClaimReplicaSets tries to take ownership of a list of ReplicaSets.
//
// It will reconcile the following:
//   - Adopt orphans if the selector matches.
//   - Release owned objects if the selector no longer matches.
//
// A non-nil error is returned if some form of reconciliation was attempted and
// failed. Usually, controllers should try again later in case reconciliation
// is still needed.
//
// If the error is nil, either the reconciliation succeeded, or no
// reconciliation was necessary. The list of ReplicaSets that you now own is
// returned.
func (m *ReplicaSetControllerRefManager) ClaimReplicaSets(ctx context.Context, sets []*apps.ReplicaSet) ([]*apps.ReplicaSet, error) {
	var claimed []*apps.ReplicaSet
	var errlist []error

	match := func(obj metav1.Object) bool {
		return m.Selector.Matches(labels.Set(obj.GetLabels()))
	}
	adopt := func(ctx context.Context, obj metav1.Object) error {
		return m.AdoptReplicaSet(ctx, obj.(*apps.ReplicaSet))
	}
	release := func(ctx context.Context, obj metav1.Object) error {
		return m.ReleaseReplicaSet(ctx, obj.(*apps.ReplicaSet))
	}

	for _, rs := range sets {
		ok, err := m.ClaimObject(ctx, rs, match, adopt, release)
		if err != nil {
			errlist = append(errlist, err)
			continue
		}
		if ok {
			claimed = append(claimed, rs)
		}
	}
	return claimed, utilerrors.NewAggregate(errlist)
}

怎么理解 adopt 和 release

// AdoptReplicaSet sends a patch to take control of the ReplicaSet. It returns
// the error if the patching fails.
func (m *ReplicaSetControllerRefManager) AdoptReplicaSet(ctx context.Context, rs *apps.ReplicaSet) error {
    if err := m.CanAdopt(ctx); err != nil {
       return fmt.Errorf("can't adopt ReplicaSet %v/%v (%v): %v", rs.Namespace, rs.Name, rs.UID, err)
    }
    // Note that ValidateOwnerReferences() will reject this patch if another
    // OwnerReference exists with controller=true.
    patchBytes, err := ownerRefControllerPatch(m.Controller, m.controllerKind, rs.UID)
    if err != nil {
       return err
    }
    return m.rsControl.PatchReplicaSet(ctx, rs.Namespace, rs.Name, patchBytes)
}

func ownerRefControllerPatch(controller metav1.Object, controllerKind schema.GroupVersionKind, uid types.UID, finalizers ...string) ([]byte, error) {
	blockOwnerDeletion := true
	isController := true
	addControllerPatch := objectForAddOwnerRefPatch{
		Metadata: objectMetaForPatch{
			UID: uid,
			OwnerReferences: []metav1.OwnerReference{
				{
					APIVersion:         controllerKind.GroupVersion().String(),
					Kind:               controllerKind.Kind,
					Name:               controller.GetName(),
					UID:                controller.GetUID(),
					Controller:         &isController,
					BlockOwnerDeletion: &blockOwnerDeletion,
				},
			},
			Finalizers: finalizers,
		},
	}
	patchBytes, err := json.Marshal(&addControllerPatch)
	if err != nil {
		return nil, err
	}
	return patchBytes, nil
}

// 如何 patch
func (r RealRSControl) PatchReplicaSet(ctx context.Context, namespace, name string, data []byte) error {
	_, err := r.KubeClient.AppsV1().ReplicaSets(namespace).Patch(ctx, name, types.StrategicMergePatchType, data, metav1.PatchOptions{})
	return err
}


// release 
func GenerateDeleteOwnerRefStrategicMergeBytes(dependentUID types.UID, ownerUIDs []types.UID, finalizers ...string) ([]byte, error) {
	var ownerReferences []map[string]string
	for _, ownerUID := range ownerUIDs {
		ownerReferences = append(ownerReferences, ownerReference(ownerUID, "delete"))
	}
	patch := objectForDeleteOwnerRefStrategicMergePatch{
		Metadata: objectMetaForMergePatch{
			UID:              dependentUID,
			OwnerReferences:  ownerReferences,
			DeleteFinalizers: finalizers,
		},
	}
	patchBytes, err := json.Marshal(&patch)
	if err != nil {
		return nil, err
	}
	return patchBytes, nil
}

为什么我们在 Adopt 前 RecheckDeletionTimestamp??

{% embed url=“https://github.com/kubernetes/kubernetes/issues/42639” %}

According to the log of the first linked failure, and a local reproduction, the RS was not orphaned because the deployment controller re-adopted the RS that was just orphaned by the garbage collector.
根据第一次链接失败的日志和本地复制，RS 不是孤立的，因为部署控制器重新采用了刚刚被垃圾收集器孤立的 RS。

I propose this fix: 我建议这个修复：
To prevent the race, the deployment controller should get the latest deployment from API server and checks its deletionTimestamp, instead of checking its local cache, before adopting the RS. This can completely prevent re-adoption. Because if the deployment tries to adopt the RS, it means the RS lacks the controllerRef, which means the garbage collector has orphaned the RS, which means the deletionTimestamp must have already been set, so the proposed check will prevent the re-adoption. (but race for new-adoption is still possible)

The GC expects that once it sees a controller with a non-nil
DeletionTimestamp, that controller will not attempt any adoption.
There was a known race condition that could cause a controller to
re-adopt something orphaned by the GC, because the controller is using a
cached value of its own spec from before DeletionTimestamp was set.

This fixes that race by doing an uncached quorum read of the controller
spec just before the first adoption attempt. It’s important that this
read occurs after listing potential orphans. Note that this uncached
read is skipped if no adoptions are attempted (i.e. at steady state).

dc.rollback

先调用getRollbackTo检查deployment对象的annotations中是否有以下key：deprecated.deployment.rollback.to，如果有且值不为空，调用 dc.rollback 方法执行 rollback 操作；

// TODO: Remove this when extensions/v1beta1 and apps/v1beta1 Deployment are dropped.
func getRollbackTo(d *apps.Deployment) *extensions.RollbackConfig {
	// Extract the annotation used for round-tripping the deprecated RollbackTo field.
	revision := d.Annotations[apps.DeprecatedRollbackTo]
	if revision == "" {
		return nil
	}
	revision64, err := strconv.ParseInt(revision, 10, 64)
	if err != nil {
		// If it's invalid, ignore it.
		return nil
	}
	return &extensions.RollbackConfig{
		Revision: revision64,
	}
}

dc.rollback主要逻辑：
（1）获取deployment的所有关联匹配的replicaset对象列表；
（2）获取需要回滚的Revision；
（3）遍历上述获得的replicaset对象列表，比较Revision是否与需要回滚的Revision一致，一致则调用dc.rollbackToTemplate做回滚操作（主要是根据特定的Revision的replicaset对象，更改deployment对象的.Spec.Template）；
（4）最后，不管有没有回滚成功，都将deployment对象的.spec.rollbackTo属性置为nil，然后更新deployment对象。

// rollback the deployment to the specified revision. In any case cleanup the rollback spec.
func (dc *DeploymentController) rollback(ctx context.Context, d *apps.Deployment, rsList []*apps.ReplicaSet) error {
	logger := klog.FromContext(ctx)
	newRS, allOldRSs, err := dc.getAllReplicaSetsAndSyncRevision(ctx, d, rsList, true)
	if err != nil {
		return err
	}

	allRSs := append(allOldRSs, newRS)
	rollbackTo := getRollbackTo(d)
	// If rollback revision is 0, rollback to the last revision
	if rollbackTo.Revision == 0 {
		if rollbackTo.Revision = deploymentutil.LastRevision(logger, allRSs); rollbackTo.Revision == 0 {
			// If we still can't find the last revision, gives up rollback
			dc.emitRollbackWarningEvent(d, deploymentutil.RollbackRevisionNotFound, "Unable to find last revision.")
			// Gives up rollback
			return dc.updateDeploymentAndClearRollbackTo(ctx, d)
		}
	}
	for _, rs := range allRSs {
		v, err := deploymentutil.Revision(rs)
		if err != nil {
			logger.V(4).Info("Unable to extract revision from deployment's replica set", "replicaSet", klog.KObj(rs), "err", err)
			continue
		}
		if v == rollbackTo.Revision {
			logger.V(4).Info("Found replica set with desired revision", "replicaSet", klog.KObj(rs), "revision", v)
			// rollback by copying podTemplate.Spec from the replica set
			// revision number will be incremented during the next getAllReplicaSetsAndSyncRevision call
			// no-op if the spec matches current deployment's podTemplate.Spec
			performedRollback, err := dc.rollbackToTemplate(ctx, d, rs)
			if performedRollback && err == nil {
				dc.emitRollbackNormalEvent(d, fmt.Sprintf("Rolled back deployment %q to revision %d", d.Name, rollbackTo.Revision))
			}
			return err
		}
	}
	dc.emitRollbackWarningEvent(d, deploymentutil.RollbackRevisionNotFound, "Unable to find the revision to rollback to.")
	// Gives up rollback
	return dc.updateDeploymentAndClearRollbackTo(ctx, d)
}

4.5 deployment controller 03

dc.sync

// sync is responsible for reconciling deployments on scaling events or when they
// are paused.
func (dc *DeploymentController) sync(ctx context.Context, d *apps.Deployment, rsList []*apps.ReplicaSet) error {
	newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(ctx, d, rsList, false)
	if err != nil {
		return err
	}
	if err := dc.scale(ctx, d, newRS, oldRSs); err != nil {
		// If we get an error while trying to scale, the deployment will be requeued
		// so we can abort this resync
		return err
	}

	// Clean up the deployment when it's paused and no rollback is in flight.
	if d.Spec.Paused && getRollbackTo(d) == nil {
		if err := dc.cleanupDeployment(ctx, oldRSs, d); err != nil {
			return err
		}
	}

	allRSs := append(oldRSs, newRS)
	return dc.syncDeploymentStatus(ctx, allRSs, newRS, d)
}

下面来分析一下dc.sync方法，以下两种情况下，都会调用dc.sync，然后直接return：
（1）判断deployment的.Spec.Paused属性值是否为true，是则调用dc.sync做处理，调用完成后直接return；
（2）先调用dc.isScalingEvent，检查deployment对象是否处于 scaling 状态，是则调用dc.sync做处理，调用完成后直接return。

关于Paused字段

deployment的.Spec.Paused为true时代表该deployment处于暂停状态，false则代表处于正常状态。当deployment处于暂停状态时，deployment对象的PodTemplateSpec的任何修改都不会触发deployment的更新，当.Spec.Paused再次赋值为false时才会触发deployment更新。

dc.sync主要逻辑：
（1）调用dc.getAllReplicaSetsAndSyncRevision获取最新的replicaset对象以及旧的replicaset对象列表；
（2）调用dc.scale，判断是否需要进行扩缩容操作，需要则进行扩缩容操作；
（3）当deployment的.Spec.Paused为true且不需要做回滚操作时，调dc.cleanupDeployment，根据deployment配置的保留历史版本数（.Spec.RevisionHistoryLimit）以及replicaset的创建时间，把最老的旧的replicaset给删除清理掉；
（4）调用dc.syncDeploymentStatus，计算并更新deployment对象的status字段。

dc.scale

dc.scale主要作用是处理deployment的扩缩容操作，其主要逻辑如下：
（1）调用deploymentutil.FindActiveOrLatest，判断是否只有最新的replicaset对象的副本数不为0，是则找到最新的replicaset对象，并判断其副本数是否与deployment期望副本数一致，是则直接return，否则调用dc.scaleReplicaSetAndRecordEvent更新其副本数为deployment的期望副本数；
（2）当最新的replicaset对象的副本数与deployment期望副本数一致，且旧的replicaset对象中有副本数不为0的，则从旧的replicset对象列表中找出副本数不为0的replicaset，调用dc.scaleReplicaSetAndRecordEvent将其副本数缩容为0，然后return；
（3）当最新的replicaset对象的副本数与deployment期望副本数不一致，旧的replicaset对象中有副本数不为0的，且deployment的更新策略为滚动更新，说明deployment可能正在滚动更新，则按一定的比例对新旧replicaset进行扩缩容操作，保证滚动更新的稳定性。

	// There are old replica sets with pods and the new replica set is not saturated.
	// We need to proportionally scale all replica sets (new and old) in case of a
	// rolling deployment.
	if deploymentutil.IsRollingUpdate(deployment) {
		allRSs := controller.FilterActiveReplicaSets(append(oldRSs, newRS))
		allRSsReplicas := deploymentutil.GetReplicaCountForReplicaSets(allRSs)

		allowedSize := int32(0)
		if *(deployment.Spec.Replicas) > 0 {
			allowedSize = *(deployment.Spec.Replicas) + deploymentutil.MaxSurge(*deployment)
		}

		// Number of additional replicas that can be either added or removed from the total
		// replicas count. These replicas should be distributed proportionally to the active
		// replica sets.
		deploymentReplicasToAdd := allowedSize - allRSsReplicas

		// The additional replicas should be distributed proportionally amongst the active
		// replica sets from the larger to the smaller in size replica set. Scaling direction
		// drives what happens in case we are trying to scale replica sets of the same size.
		// In such a case when scaling up, we should scale up newer replica sets first, and
		// when scaling down, we should scale down older replica sets first.
		var scalingOperation string
		switch {
		case deploymentReplicasToAdd > 0:
			sort.Sort(controller.ReplicaSetsBySizeNewer(allRSs))
			scalingOperation = "up"

		case deploymentReplicasToAdd < 0:
			sort.Sort(controller.ReplicaSetsBySizeOlder(allRSs))
			scalingOperation = "down"
		}

		// Iterate over all active replica sets and estimate proportions for each of them.
		// The absolute value of deploymentReplicasAdded should never exceed the absolute
		// value of deploymentReplicasToAdd.
		deploymentReplicasAdded := int32(0)
		nameToSize := make(map[string]int32)
		logger := klog.FromContext(ctx)
		for i := range allRSs {
			rs := allRSs[i]

			// Estimate proportions if we have replicas to add, otherwise simply populate
			// nameToSize with the current sizes for each replica set.
			if deploymentReplicasToAdd != 0 {
				proportion := deploymentutil.GetProportion(logger, rs, *deployment, deploymentReplicasToAdd, deploymentReplicasAdded)

				nameToSize[rs.Name] = *(rs.Spec.Replicas) + proportion
				deploymentReplicasAdded += proportion
			} else {
				nameToSize[rs.Name] = *(rs.Spec.Replicas)
			}
		}

		// Update all replica sets
		for i := range allRSs {
			rs := allRSs[i]

			// Add/remove any leftovers to the largest replica set.
			if i == 0 && deploymentReplicasToAdd != 0 {
				leftover := deploymentReplicasToAdd - deploymentReplicasAdded
				nameToSize[rs.Name] = nameToSize[rs.Name] + leftover
				if nameToSize[rs.Name] < 0 {
					nameToSize[rs.Name] = 0
				}
			}

			// TODO: Use transactions when we have them.
			if _, _, err := dc.scaleReplicaSet(ctx, rs, nameToSize[rs.Name], deployment, scalingOperation); err != nil {
				// Return as soon as we fail, the deployment is requeued
				return err
			}
		}
	}

Deployment 会在 .spec.strategy.type==RollingUpdate时，采取滚动更新的方式更新 Pod。你可以指定 maxUnavailable 和 maxSurge 来控制滚动更新过程。

最大不可用

.spec.strategy.rollingUpdate.maxUnavailable 是一个可选字段，用来指定更新过程中不可用的 Pod 的个数上限。该值可以是绝对数字（例如，5），也可以是所需 Pod 的百分比（例如，10%）。百分比值会转换成绝对数并去除小数部分。如果 .spec.strategy.rollingUpdate.maxSurge 为 0，则此值不能为 0。默认值为 25%。

例如，当此值设置为 30% 时，滚动更新开始时会立即将旧 ReplicaSet 缩容到期望 Pod 个数的70%。新 Pod 准备就绪后，可以继续缩容旧有的 ReplicaSet，然后对新的 ReplicaSet 扩容，确保在更新期间可用的 Pod 总数在任何时候都至少为所需的 Pod 个数的 70%。

最大峰值

.spec.strategy.rollingUpdate.maxSurge 是一个可选字段，用来指定可以创建的超出期望 Pod 个数的 Pod 数量。此值可以是绝对数（例如，5）或所需 Pod 的百分比（例如，10%）。如果 MaxUnavailable 为 0，则此值不能为 0。百分比值会通过向上取整转换为绝对数。此字段的默认值为 25%。

例如，当此值为 30% 时，启动滚动更新后，会立即对新的 ReplicaSet 扩容，同时保证新旧 Pod 的总数不超过所需 Pod 总数的 130%。一旦旧 Pod 被杀死，新的 ReplicaSet 可以进一步扩容，同时确保更新期间的任何时候运行中的 Pod 总数最多为所需 Pod 总数的 130%。

dc.RolloutRolling

	switch d.Spec.Strategy.Type {
	case apps.RecreateDeploymentStrategyType:
		return dc.rolloutRecreate(ctx, d, rsList, podMap)
	case apps.RollingUpdateDeploymentStrategyType:
		return dc.rolloutRolling(ctx, d, rsList)
	}
	return fmt.Errorf("unexpected deployment strategy type: %s", d.Spec.Strategy.Type)

// rolloutRolling implements the logic for rolling a new replica set.
func (dc *DeploymentController) rolloutRolling(ctx context.Context, d *apps.Deployment, rsList []*apps.ReplicaSet) error {
	newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(ctx, d, rsList, true)
	if err != nil {
		return err
	}
	allRSs := append(oldRSs, newRS)

	// Scale up, if we can.
	scaledUp, err := dc.reconcileNewReplicaSet(ctx, allRSs, newRS, d)
	if err != nil {
		return err
	}
	if scaledUp {
		// Update DeploymentStatus
		return dc.syncRolloutStatus(ctx, allRSs, newRS, d)
	}

	// Scale down, if we can.
	scaledDown, err := dc.reconcileOldReplicaSets(ctx, allRSs, controller.FilterActiveReplicaSets(oldRSs), newRS, d)
	if err != nil {
		return err
	}
	if scaledDown {
		// Update DeploymentStatus
		return dc.syncRolloutStatus(ctx, allRSs, newRS, d)
	}

	if deploymentutil.DeploymentComplete(d, &d.Status) {
		if err := dc.cleanupDeployment(ctx, oldRSs, d); err != nil {
			return err
		}
	}

	// Sync deployment status
	return dc.syncRolloutStatus(ctx, allRSs, newRS, d)
}

本文由博客一文多发平台 OpenWrite 发布！

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

南哥教你云原生

架构师

手记
篇

粉丝

0

获赞与收藏

1

关注作者，订阅最新文章

阅读免费教程

Go 入门教程

47个小节 66968 634

Django 入门教程

37个小节 29933 320

后端通用面试教程

41个小节 30936 346

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空