Embodied views of language comprehension argue that the body’s perceptual and motor systems ground linguistic meaning. One source of support for this view comes from the sentence-picture verification (SPV) shape match effect, in which observers read a sentence that implies – but does not explicitly describe – an object’s shape and then report whether a pictured object was mentioned in the sentence. Participants are typically faster to verify a sentence mentioned an object when the object’s shape matches the shape implied by that sentence’s context. However, several high-profile studies have failed to replicate this and other key results supporting the sensorimotor simulation view, raising questions about the extent to which language comprehension relies upon modal representations. One explanation for these conflicting findings is that individuals do not obligatorily tap perceptual and motor systems to understand language but are instead flexible in their use of sensorimotor simulation depending upon contextual factors such as task demands. I investigated this possibility by asking participants to perform the SPV task while concurrently holding either auditory or visual information in working memory. I found that higher working memory loads in both the auditory and the visual domains attenuated the SPV match advantage. This finding suggests that sensorimotor simulation is not necessary for language comprehension but may instead depend in part upon the availability of resources in working memory.