HomeKubernetesIntroducing Single Pod Access Mode for PersistentVolumes

Introducing Single Pod Access Mode for PersistentVolumes

Author: Chris Henzie (Google)

Last month’s release of Kubernetes v1.22 introduced a new ReadWriteOncePod entry mode for PersistentVolumes and PersistentVolumeClaims.
With this alpha feature, Kubernetes allows you to restrict volume entry to a single pod in the cluster.

What are entry modes and why are they necessary?

When using storage, there are different ways to model how that storage is consumed.

For example, a storage system like a community file share can have many users all reading and writing data simultaneously.
In other cases maybe everybody is allowed to read data but not write it.
For highly sensitive data, maybe only 1 user is allowed to read and write data but nobody else.

In the world of Kubernetes, entry modes are the way you can define how durable storage is consumed.
These entry modes are a part of the spec for PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs).

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: shared-cache
spec:
  accessModes:
  - ReadWriteMany # Allow many pods to entry shared-cache simultaneously.
  sources:
    requests:
      storage: 1Gi

Before v1.22, Kubernetes offered 3 entry modes for PVs and PVCs:

  • ReadWriteOnce – the volume can be mounted as read-write by a single node
  • ReadOnlyMany – the volume can be mounted read-only by many nodes
  • ReadWriteMany – the volume can be mounted as read-write by many nodes

These entry modes are enforced by Kubernetes components like the kube-controller-supervisor and kubelet to ensure only sure pods are allowed to entry a given PersistentVolume.

What is this new entry mode and how does it work?

Kubernetes v1.22 introduced a fourth entry mode for PVs and PVCs, that you can use for CSI volumes:

  • ReadWriteOncePod – the volume can be mounted as read-write by a single pod

If you create a pod with a PVC that uses the ReadWriteOncePod entry mode, Kubernetes ensures that pod is the only pod across your whole cluster that can read that PVC or write to it.

If you create another pod that references the same PVC with this entry mode, the pod will fail to start because the PVC is already in use by another pod.
For example:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  1s    default-scheduler  0/1 nodes are available: 1 node has pod using PersistentVolumeClaim with the same name and ReadWriteOncePod entry mode.

How is this different than the ReadWriteOnce entry mode?

The ReadWriteOnce entry mode restricts volume entry to a single node, which means it is possible for multiple pods on the same node to read from and write to the same volume.
This could probably be a major dispute for some applications, especially if they require at most 1 writer for data safety guarantees.

With ReadWriteOncePod these issues go away.
Set the entry mode on your PVC, and Kubernetes guarantees that only a single pod has entry.

How do I use it?

The ReadWriteOncePod entry mode is in alpha for Kubernetes v1.22 and is only supported for CSI volumes.
As a first step you need to enable the ReadWriteOncePod feature gate for kube-apiserver, kube-scheduler, and kubelet.
You can enable the feature by setting command line arguments:

--feature-gates="...,ReadWriteOncePod=true"

You also need to update the following CSI sidecars to these versions or greater:

Creating a PersistentVolumeClaim

In order to use the ReadWriteOncePod entry mode for your PVs and PVCs, you will need to create a new PVC with the entry mode:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: single-writer-only
spec:
  accessModes:
  - ReadWriteOncePod # Allow only a single pod to entry single-writer-only.
  sources:
    requests:
      storage: 1Gi

If your storage plugin helps dynamic provisioning, new PersistentVolumes will be created with the ReadWriteOncePod entry mode utilized.

Migrating existing PersistentVolumes

If you have existing PersistentVolumes, they can be migrated to use ReadWriteOncePod.

In this example, we already have a “cat-pictures-pvc” PersistentVolumeClaim that is bound to a “cat-pictures-pv” PersistentVolume, and a “cat-pictures-writer” Deployment that uses this PersistentVolumeClaim.

As a first step, you need to edit your PersistentVolume’s spec.persistentVolumeReclaimPolicy and set it to Retain.
This ensures your PersistentVolume will not be deleted when we delete the corresponding PersistentVolumeClaim:

kubectl patch pv cat-pictures-pv -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'

Next you need to stop any workloads that are using the PersistentVolumeClaim bound to the PersistentVolume you want to migrate, and then delete the PersistentVolumeClaim.

Once that is done, you need to clear your PersistentVolume’s spec.claimRef.uid to ensure PersistentVolumeClaims can bind to it upon recreation:

kubectl scale --replicas=0 deployment cat-pictures-writer
kubectl delete pvc cat-pictures-pvc
kubectl patch pv cat-pictures-pv -p '{"spec":{"claimRef":{"uid":""}}}'

After that you need to replace the PersistentVolume’s entry modes with ReadWriteOncePod:

kubectl patch pv cat-pictures-pv -p '{"spec":{"accessModes":["ReadWriteOncePod"]}}'

Note: The ReadWriteOncePod entry mode cannot be combined with other entry modes.
Make sure ReadWriteOncePod is the only entry mode on the PersistentVolume when updating, otherwise the request will fail.

Next you need to modify your PersistentVolumeClaim to set ReadWriteOncePod as the only entry mode.
You should also set your PersistentVolumeClaim’s spec.volumeName to the name of your PersistentVolume.

Once this is done, you can recreate your PersistentVolumeClaim and start up your workloads:

# IMPORTANT: Make sure to edit your PVC in cat-pictures-pvc.yaml before applying. You need to:
# - Set ReadWriteOncePod as the only entry mode
# - Set spec.volumeName to "cat-pictures-pv"

kubectl apply -f cat-pictures-pvc.yaml
kubectl apply -f cat-pictures-writer-deployment.yaml

Lastly you may edit your PersistentVolume’s spec.persistentVolumeReclaimPolicy and set to it back to Delete if you previously changed it.

kubectl patch pv cat-pictures-pv -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'

You can read Configure a Pod to Use a PersistentVolume for Storage for more details on working with PersistentVolumes and PersistentVolumeClaims.

What volume plugins support this?

The only volume plugins that support this are CSI drivers.
SIG Storage does not plan to support this for in-tree plugins because they are being deprecated as part of CSI migration.
Support may be considered for beta for users that prefer to use the legacy in-tree volume APIs with CSI migration enabled.

As a storage vendor, how do I add support for this entry mode to my CSI driver?

The ReadWriteOncePod entry mode will work out of the box without any required updates to CSI drivers, but does require updates to CSI sidecars.
With that being said, if you would like to stay up to date with the latest changes to the CSI specification (v1.5.0+), read on.

Two new entry modes were introduced to the CSI specification in order to disambiguate the legacy SINGLE_NODE_WRITER entry mode.
They are SINGLE_NODE_SINGLE_WRITER and SINGLE_NODE_MULTI_WRITER.
In order to communicate to sidecars (like the external-provisioner) that your driver understands and accepts these 2 new CSI entry modes, your driver will also need to advertise the SINGLE_NODE_MULTI_WRITER capability for the controller service and node service.

If you’d like to read up on the motivation for these entry modes and capability bits, you can also read the CSI Specification Changes, Volume Capabilities section of KEP-2485 (ReadWriteOncePod PersistentVolume Access Mode).

Update your CSI driver to use the new interface

As a first step you will need to update your driver’s container-storage-interface dependency to v1.5.0+, which contains support for these new entry modes and capabilities.

Accept new CSI entry modes

If your CSI driver contains logic for validating CSI entry modes for requests , it may need updating.
If it currently accepts SINGLE_NODE_WRITER, it should be updated to also accept SINGLE_NODE_SINGLE_WRITER and SINGLE_NODE_MULTI_WRITER.

Using the GCP PD CSI driver validation logic as an example, here is how it can be extended:

diff --git a/pkg/gce-pd-csi-driver/utils.go b/pkg/gce-pd-csi-driver/utils.go
index 281242c..b6c5229 100644
--- a/pkg/gce-pd-csi-driver/utils.go
+++ b/pkg/gce-pd-csi-driver/utils.go
@@ -123,6 +123,8 @@ func validateAccessMode(am *csi.VolumeCapability_AccessMode) error {
        case csi.VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY:
        case csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY:
        case csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER:
+       case csi.VolumeCapability_AccessMode_SINGLE_NODE_SINGLE_WRITER:
+       case csi.VolumeCapability_AccessMode_SINGLE_NODE_MULTI_WRITER:
        default:
                return fmt.Errorf("%v entry mode is not supported for for PD", am.GetMode())
        }

Your CSI driver will also need to return the new SINGLE_NODE_MULTI_WRITER capability as part of the ControllerGetCapabilities and NodeGetCapabilities RPCs.

Using the GCP PD CSI driver capability commercial logic as an example, here is how it can be extended:

diff --git a/pkg/gce-pd-csi-driver/gce-pd-driver.go b/pkg/gce-pd-csi-driver/gce-pd-driver.go
index 45903f3..0d7ea26 100644
--- a/pkg/gce-pd-csi-driver/gce-pd-driver.go
+++ b/pkg/gce-pd-csi-driver/gce-pd-driver.go
@@ -56,6 +56,8 @@ func (gceDriver *GCEDriver) SetupGCEDriver(name, vendorVersion string, extraVolu
                csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
                csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY,
                csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER,
+               csi.VolumeCapability_AccessMode_SINGLE_NODE_SINGLE_WRITER,
+               csi.VolumeCapability_AccessMode_SINGLE_NODE_MULTI_WRITER,
        }
        gceDriver.AddVolumeCapabilityAccessModes(vcam)
        csc := []csi.ControllerServiceCapability_RPC_Type{
@@ -67,12 +69,14 @@ func (gceDriver *GCEDriver) SetupGCEDriver(name, vendorVersion string, extraVolu
                csi.ControllerServiceCapability_RPC_EXPAND_VOLUME,
                csi.ControllerServiceCapability_RPC_LIST_VOLUMES,
                csi.ControllerServiceCapability_RPC_LIST_VOLUMES_PUBLISHED_NODES,
+               csi.ControllerServiceCapability_RPC_SINGLE_NODE_MULTI_WRITER,
        }
        gceDriver.AddControllerServiceCapabilities(csc)
        ns := []csi.NodeServiceCapability_RPC_Type{
                csi.NodeServiceCapability_RPC_STAGE_UNSTAGE_VOLUME,
                csi.NodeServiceCapability_RPC_EXPAND_VOLUME,
                csi.NodeServiceCapability_RPC_GET_VOLUME_STATS,
+               csi.NodeServiceCapability_RPC_SINGLE_NODE_MULTI_WRITER,
        }
        gceDriver.AddNodeServiceCapabilities(ns)

Implement NodePublishVolume behavior

The CSI spec outlines anticipated behavior for the NodePublishVolume RPC when called more than once for the same volume but with different arguments (like the goal path).
Please refer to the second desk in the NodePublishVolume section of the CSI spec for more details on anticipated behavior when implementing in your driver.

Update your CSI sidecars

When deploying your CSI drivers, you should update the following CSI sidecars to versions that depend on CSI spec v1.5.0+ and the Kubernetes v1.22 API.
The minimal required versions are:

What’s next?

As part of the beta graduation for this feature, SIG Storage plans to update the Kubenetes scheduler to support pod preemption in relation to ReadWriteOncePod storage.
This means if 2 pods request a PersistentVolumeClaim with ReadWriteOncePod, the pod with highest priority will achieve entry to the PersistentVolumeClaim and any pod with lower priority will be preempted from the node and be unable to entry the PersistentVolumeClaim.

How can I learn more?

Please see KEP-2485 for more details on the ReadWriteOncePod entry mode and motivations for CSI spec changes.

How do I get involved?

The Kubernetes #csi Slack channel and any of the standard SIG Storage communication channels are worthy mediums to reach out to the SIG Storage and the CSI teams.

Special thanks to the following people for their insightful reviews and design considerations:

  • Abdullah Gharaibeh (ahg-g)
  • Aldo Culquicondor (alculquicondor)
  • Ben Swartzlander (bswartz)
  • Deep Debroy (ddebroy)
  • Hemant Kumar (gnufied)
  • Humble Devassy Chirammal (humblec)
  • James DeFelice (jdef)
  • Jan Šafránek (jsafrane)
  • Jing Xu (jingxu97)
  • Jordan Liggitt (liggitt)
  • Michelle Au (msau42)
  • Saad Ali (saad-ali)
  • Tim Hockin (thockin)
  • Xing Yang (xing-yang)

If you’re interested in getting involved with the design and development of CSI or any part of the Kubernetes storage system, join the Kubernetes Storage Special Interest Group (SIG).
We’re rapidly growing and always welcome new contributors.

Source

Most Popular