Master TensorFlow Distributed Training: MirroredStrategy, TPUStrategy, and More

Wait 5 sec.

Content OverviewOverviewSet up TensorFlowTypes of strategiesMirroredStrategyTPUStrategyMultiWorkerMirroredStrategyParameterServerStrategyCentralStorageStrategyOther strategiesUse tf.distribute.Strategy with Keras Model.fitWhat’s supported now?Examples and tutorials\Overviewtf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Using this API, you can distribute your existing models and training code with minimal code changes.tf.distribute.Strategy has been designed with these key goals in mind:Easy to use and support multiple user segments, including researchers, machine learning engineers, etc.Provide good performance out of the box.Easy switching between strategies.You can distribute training using tf.distribute.Strategy with a high-level API like Keras Model.fit, as well as custom training loops (and, in general, any computation using TensorFlow).In TensorFlow 2.x, you can execute your programs eagerly, or in a graph using tf.function. tf.distribute.Strategy intends to support both these modes of execution, but works best with tf.function. Eager mode is only recommended for debugging purposes and not supported for tf.distribute.TPUStrategy. Although training is the focus of this guide, this API can also be used for distributing evaluation and prediction on different platforms.You can use tf.distribute.Strategy with very few changes to your code, because the underlying components of TensorFlow have been changed to become strategy-aware. This includes variables, layers, models, optimizers, metrics, summaries, and checkpoints.In this guide, you will learn about various types of strategies and how you can use them in different situations. To learn how to debug performance issues, check out the Optimize TensorFlow GPU performance guide.\:::tipNote: For a deeper understanding of the concepts, watch the deep-dive presentation—Inside TensorFlow: tf.distribute.Strategy. This is especially recommended if you plan to write your own training loop.:::Set up TensorFlowimport tensorflow as tf\2024-10-25 03:10:09.809713: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registeredWARNING: All log messages before absl::InitializeLog() is called are written to STDERRE0000 00:00:1729825809.832772 192915 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registeredE0000 00:00:1729825809.839425 192915 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registeredTypes of strategiestf.distribute.Strategy intends to cover a number of use cases along different axes. Some of these combinations are currently supported and others will be added in the future. Some of these axes are:Synchronous vs asynchronous training: These are two common ways of distributing training with data parallelism. In sync training, all workers train over different slices of input data in sync, and aggregating gradients at each step. In async training, all workers are independently training over the input data and updating variables asynchronously. Typically sync training is supported via all-reduce and async through parameter server architecture.Hardware platform: You may want to scale your training onto multiple GPUs on one machine, or multiple machines in a network (with 0 or more GPUs each), or on Cloud TPUs.In order to support these use cases, TensorFlow has MirroredStrategy, TPUStrategy, MultiWorkerMirroredStrategy, ParameterServerStrategy, CentralStorageStrategy, as well as other strategies available. The next section explains which of these are supported in which scenarios in TensorFlow. Here is a quick overview:| Training API | MirroredStrategy | TPUStrategy | MultiWorkerMirroredStrategy | CentralStorageStrategy | ParameterServerStrategy ||----|----|----|----|----|----|| Keras Model.fit | Supported | Supported | Supported | Experimental support | Experimental support || Custom training loop | Supported | Supported | Supported | Experimental support | Experimental support || Estimator API | Limited Support | Not supported | Limited Support | Limited Support | Limited Support |\:::tipNote: Experimental support means the APIs are not covered by any compatibility guarantees.:::\:::warningWarning: Estimator support is limited. Basic training and evaluation are experimental, and advanced features—such as scaffold—are not implemented. You should be using Keras or custom training loops if a use case is not covered. Estimators are not recommended for new code. Estimators run v1.Session-style code which is more difficult to write correctly, and can behave unexpectedly, especially when combined with TF 2 code. Estimators do fall under our compatibility guarantees, but will receive no fixes other than security vulnerabilities. Go to the migration guide for details.:::MirroredStrategytf.distribute.MirroredStrategy supports synchronous distributed training on multiple GPUs on one machine. It creates one replica per GPU device. Each variable in the model is mirrored across all the replicas. Together, these variables form a single conceptual variable called MirroredVariable. These variables are kept in sync with each other by applying identical updates.Efficient all-reduce algorithms are used to communicate the variable updates across the devices. All-reduce aggregates tensors across all the devices by adding them up, and makes them available on each device. It’s a fused algorithm that is very efficient and can reduce the overhead of synchronization significantly. There are many all-reduce algorithms and implementations available, depending on the type of communication available between devices. By default, it uses the NVIDIA Collective Communication Library (NCCL) as the all-reduce implementation. You can choose from a few other options or write your own.Here is the simplest way of creating MirroredStrategy:\mirrored_strategy = tf.distribute.MirroredStrategy()\INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)W0000 00:00:1729825812.490898 192915 gpu_device.cc:2344] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.Skipping registering GPU devices...This will create a MirroredStrategy instance, which will use all the GPUs that are visible to TensorFlow, and NCCL—as the cross-device communication.If you wish to use only some of the GPUs on your machine, you can do so like this:\mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])\INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')If you wish to override the cross device communication, you can do so using the cross_device_ops argument by supplying an instance of tf.distribute.CrossDeviceOps. Currently, tf.distribute.HierarchicalCopyAllReduce and tf.distribute.ReductionToOneDevice are two options other than tf.distribute.NcclAllReduce, which is the default.\mirrored_strategy = tf.distribute.MirroredStrategy( cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())\INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)TPUStrategytf.distribute.TPUStrategy lets you run your TensorFlow training on Tensor Processing Units (TPUs). TPUs are Google's specialized ASICs designed to dramatically accelerate machine learning workloads. They are available on Google Colab, the TPU Research Cloud, and Cloud TPU.In terms of distributed training architecture, TPUStrategy is the same MirroredStrategy—it implements synchronous distributed training. TPUs provide their own implementation of efficient all-reduce and other collective operations across multiple TPU cores, which are used in TPUStrategy.Here is how you would instantiate TPUStrategy::::tipNote: To run any TPU code in Colab, you should select TPU as the Colab runtime. Refer to the Use TPUs guide for a complete example.:::\cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver( tpu=tpu_address)tf.config.experimental_connect_to_cluster(cluster_resolver)tf.tpu.experimental.initialize_tpu_system(cluster_resolver)tpu_strategy = tf.distribute.TPUStrategy(cluster_resolver)The TPUClusterResolver instance helps locate the TPUs. In Colab, you don't need to specify any arguments to it.If you want to use this for Cloud TPUs:You must specify the name of your TPU resource in the tpu argument.You must initialize the TPU system explicitly at the start of the program. This is required before TPUs can be used for computation. Initializing the TPU system also wipes out the TPU memory, so it's important to complete this step first in order to avoid losing state.MultiWorkerMirroredStrategytf.distribute.MultiWorkerMirroredStrategy is very similar to MirroredStrategy. It implements synchronous distributed training across multiple workers, each with potentially multiple GPUs. Similar to tf.distribute.MirroredStrategy, it creates copies of all variables in the model on each device across all workers.Here is the simplest way of creating MultiWorkerMirroredStrategy:\strategy = tf.distribute.MultiWorkerMirroredStrategy()\WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.INFO:tensorflow:Using MirroredStrategy with devices ('/device:CPU:0',)INFO:tensorflow:Single-worker MultiWorkerMirroredStrategy with local_devices = ('/device:CPU:0',), communication = CommunicationImplementation.AUTOMultiWorkerMirroredStrategy has two implementations for cross-device communications. CommunicationImplementation.RING is RPC-based and supports both CPUs and GPUs. CommunicationImplementation.NCCL uses NCCL and provides state-of-art performance on GPUs but it doesn't support CPUs. CollectiveCommunication.AUTO defers the choice to Tensorflow. You can specify them in the following way:\communication_options = tf.distribute.experimental.CommunicationOptions( implementation=tf.distribute.experimental.CommunicationImplementation.NCCL)strategy = tf.distribute.MultiWorkerMirroredStrategy( communication_options=communication_options)\WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.INFO:tensorflow:Using MirroredStrategy with devices ('/device:CPU:0',)WARNING:tensorflow:Enabled NCCL communication but no GPUs detected/specified.INFO:tensorflow:Single-worker MultiWorkerMirroredStrategy with local_devices = ('/device:CPU:0',), communication = CommunicationImplementation.NCCLOne of the key differences to get multi worker training going, as compared to multi-GPU training, is the multi-worker setup. The 'TF_CONFIG' environment variable is the standard way in TensorFlow to specify the cluster configuration to each worker that is part of the cluster. Learn more in the setting up TF_CONFIG section of this document.For more details about MultiWorkerMirroredStrategy, consider the following tutorials:Multi-worker training with Keras Model.fitMulti-worker training with a custom training loopParameterServerStrategyParameter server training is a common data-parallel method to scale up model training on multiple machines. A parameter server training cluster consists of workers and parameter servers. Variables are created on parameter servers and they are read and updated by workers in each step. Check out the Parameter server training tutorial for details.In TensorFlow 2, parameter server training uses a central coordinator-based architecture via the tf.distribute.experimental.coordinator.ClusterCoordinator class.In this implementation, the worker and parameter server tasks run tf.distribute.Servers that listen for tasks from the coordinator. The coordinator creates resources, dispatches training tasks, writes checkpoints, and deals with task failures.In the programming running on the coordinator, you will use a ParameterServerStrategy object to define a training step and use a ClusterCoordinator to dispatch training steps to remote workers. Here is the simplest way to create them:\strategy = tf.distribute.experimental.ParameterServerStrategy( tf.distribute.cluster_resolver.TFConfigClusterResolver(), variable_partitioner=variable_partitioner)coordinator = tf.distribute.experimental.coordinator.ClusterCoordinator( strategy)To learn more about ParameterServerStrategy, check out the Parameter server training with Keras Model.fit and a custom training loop tutorial.:::tipNote: You will need to configure the 'TF_CONFIG' environment variable if you use TFConfigClusterResolver. It is similar to 'TF_CONFIG' in MultiWorkerMirroredStrategy but has additional caveats.:::In TensorFlow 1, ParameterServerStrategy is available only with an Estimator via tf.compat.v1.distribute.experimental.ParameterServerStrategy symbol.\:::tipNote: This strategy is experimental as it is currently under active development.:::CentralStorageStrategytf.distribute.experimental.CentralStorageStrategy does synchronous training as well. Variables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs. If there is only one GPU, all variables and operations will be placed on that GPU.Create an instance of CentralStorageStrategy by:\central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy()\INFO:tensorflow:ParameterServerStrategy (CentralStorageStrategy if you are using a single machine) with compute_devices = ['/job:localhost/replica:0/task:0/device:CPU:0'], variable_device = '/job:localhost/replica:0/task:0/device:CPU:0'This will create a CentralStorageStrategy instance which will use all visible GPUs and CPU. Update to variables on replicas will be aggregated before being applied to variables.:::tipNote: This strategy is experimental, as it is currently a work in progress.:::Other strategiesIn addition to the above strategies, there are two other strategies which might be useful for prototyping and debugging when using tf.distribute APIs.Default StrategyThe Default Strategy is a distribution strategy which is present when no explicit distribution strategy is in scope. It implements the tf.distribute.Strategy interface but is a pass-through and provides no actual distribution. For instance, Strategy.run(fn) will simply call fn. Code written using this strategy should behave exactly as code written without any strategy. You can think of it as a "no-op" strategy.The Default Strategy is a singleton—and one cannot create more instances of it. It can be obtained using tf.distribute.get_strategy outside any explicit strategy's scope (the same API that can be used to get the current strategy inside an explicit strategy's scope).\default_strategy = tf.distribute.get_strategy()This strategy serves two main purposes:It allows writing distribution-aware library code unconditionally. For example, in tf.keras.optimizers you can use tf.distribute.get_strategy and use that strategy for reducing gradients—it will always return a strategy object on which you can call the Strategy.reduce API.\# In optimizer or other library code# Get currently active strategystrategy = tf.distribute.get_strategy()strategy.reduce("SUM", 1., axis=None) # reduce some values\1.0Similar to library code, it can be used to write end users' programs to work with and without distribution strategy, without requiring conditional logic. Here's a sample code snippet illustrating this:\if tf.config.list_physical_devices('GPU'): strategy = tf.distribute.MirroredStrategy()else: # Use the Default Strategy strategy = tf.distribute.get_strategy()with strategy.scope(): # Do something interesting print(tf.Variable(1.))\OneDeviceStrategytf.distribute.OneDeviceStrategy is a strategy to place all variables and computation on a single specified device.\strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")This strategy is distinct from the Default Strategy in a number of ways. In the Default Strategy, the variable placement logic remains unchanged when compared to running TensorFlow without any distribution strategy. But when using OneDeviceStrategy, all variables created in its scope are explicitly placed on the specified device. Moreover, any functions called via OneDeviceStrategy.run will also be placed on the specified device.Input distributed through this strategy will be prefetched to the specified device. In the Default Strategy, there is no input distribution.Similar to the Default Strategy, this strategy could also be used to test your code before switching to other strategies which actually distribute to multiple devices/machines. This will exercise the distribution strategy machinery somewhat more than the Default Strategy, but not to the full extent of using, for example, MirroredStrategy or TPUStrategy. If you want code that behaves as if there is no strategy, then use the Default Strategy.So far you've learned about different strategies and how you can instantiate them. The next few sections show the different ways in which you can use them to distribute your training.Use tf.distribute.Strategy with Keras Model.fittf.distribute.Strategy is integrated into tf.keras, which is TensorFlow's implementation of the Keras API specification. tf.keras is a high-level API to build and train models. By integrating into the tf.keras backend, it's seamless for you to distribute your training written in the Keras training framework using Model.fit.Here's what you need to change in your code:Create an instance of the appropriate tf.distribute.Strategy.Move the creation of Keras model, optimizer and metrics inside strategy.scope. Thus the code in the model's call(), train_step(), and test_step() methods will all be distributed and executed on the accelerator(s).TensorFlow distribution strategies support all types of Keras models—Sequential, Functional, and subclassedHere is a snippet of code to do this for a very simple Keras model with one Dense layer:\mirrored_strategy = tf.distribute.MirroredStrategy()with mirrored_strategy.scope(): model = tf.keras.Sequential([ tf.keras.layers.Dense(1, input_shape=(1,), kernel_regularizer=tf.keras.regularizers.L2(1e-4))]) model.compile(loss='mse', optimizer='sgd')\INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/keras/src/layers/core/dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs)This example uses MirroredStrategy, so you can run this on a machine with multiple GPUs. strategy.scope() indicates to Keras which strategy to use to distribute the training. Creating models/optimizers/metrics inside this scope allows you to create distributed variables instead of regular variables. Once this is set up, you can fit your model like you would normally. MirroredStrategy takes care of replicating the model's training on the available GPUs, aggregating gradients, and more.\dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100).batch(10)model.fit(dataset, epochs=2)model.evaluate(dataset)\2024-10-25 03:10:12.686928: W tensorflow/core/framework/dataset.cc:993] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.Epoch 1/2 1/10 ━━━━━━━━━━━━━━━━━━━━ 2s 287ms/step - loss: 0.649210/10 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 0.5412Epoch 2/2 1/10 ━━━━━━━━━━━━━━━━━━━━ 0s 61ms/step - loss: 0.286910/10 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - loss: 0.2392 1/10 ━━━━━━━━━━━━━━━━━━━━ 1s 168ms/step - loss: 0.126810/10 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 0.12680.1268472820520401Here a tf.data.Dataset provides the training and eval input. You can also use NumPy arrays:\import numpy as npinputs, targets = np.ones((100, 1)), np.ones((100, 1))model.fit(inputs, targets, epochs=2, batch_size=10)\Epoch 1/2 2/10 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 0.124410/10 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 0.1058Epoch 2/2 3/10 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 0.0539 10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 0.0468In both cases—with Dataset or NumPy—each batch of the given input is divided equally among the multiple replicas. For instance, if you are using the MirroredStrategy with 2 GPUs, each batch of size 10 will be divided among the 2 GPUs, with each receiving 5 input examples in each step. Each epoch will then train faster as you add more GPUs. Typically, you would want to increase your batch size as you add more accelerators, so as to make effective use of the extra computing power. You will also need to re-tune your learning rate, depending on the model. You can use strategy.num_replicas_in_sync to get the number of replicas.\mirrored_strategy.num_replicas_in_sync\1\# Compute a global batch size using a number of replicas.BATCH_SIZE_PER_REPLICA = 5global_batch_size = (BATCH_SIZE_PER_REPLICA * mirrored_strategy.num_replicas_in_sync)dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100)dataset = dataset.batch(global_batch_size)LEARNING_RATES_BY_BATCH_SIZE = {5: 0.1, 10: 0.15, 20:0.175}learning_rate = LEARNING_RATES_BY_BATCH_SIZE[global_batch_size]What's supported now?| Training API | MirroredStrategy | TPUStrategy | MultiWorkerMirroredStrategy | ParameterServerStrategy | CentralStorageStrategy ||----|----|----|----|----|----|| Keras Model.fit | Supported | Supported | Supported | Experimental support | Experimental support |Examples and tutorialsHere is a list of tutorials and examples that illustrate the above integration end-to-end with Keras Model.fit:Tutorial: Training with Model.fit and MirroredStrategy.Tutorial: Training with Model.fit and MultiWorkerMirroredStrategy.Guide: Contains an example of using Model.fit and TPUStrategy.Tutorial: Parameter server training with Model.fit and ParameterServerStrategy.Tutorial: Fine-tuning BERT for many tasks from the GLUE benchmark with Model.fit and TPUStrategy.TensorFlow Model Garden repository containing collections of state-of-the-art models implemented using various strategies.\\:::infoOriginally published on the TensorFlow website, this article appears here under a new headline and is licensed under CC BY 4.0. Code samples shared under the Apache 2.0 License.:::\