It Takes Two – Text & Image in Comics

So, comics as an art form! A truly legitimate art form, unique and self-actualized, with debacles and triumphs all its own. Not the bastard child of film and literature, and not just for kids, male adolescents, or lonely 30-year-old men! Comics can be for anyone! Comics have come a long way. But like any art form, along the way, comics has stopped and asked, “What am I?”

Scott McCloud puts it forth in his Understanding Comics. His definition is technically satisfactory, and manages to be inclusive of all things we consider comics. But even after digesting his book and tons of comics afterwards, I remain unsatisfied with this definition. Why?

Text. Despite lack of text not excluding any particular work from fulfilling the definition of a comic, there is a great supply of text amassed within comics. A multitude of printed words take up sizable chunks of space on the vast majority of comic pages. Clearly they play an important role, or they wouldn’t be there! What is that role?

McCloud spends a whole chapter talking about it, but nothing seems to get specific. He lists the various ways pictures and words can be combined, illustrates their differences, and leaves us with a final quote, “However much we chart these things, they’re all ultimately best left to the creator’s instincts. The mixing of words and pictures is more alchemy than science.”

When I first read the book, I took that as the final word on the matter – that the matter of mixing word and image was at best a foggy science or a mystic storm. But as I’ve continued my forays into the world of comics, as both aspiring creator and reader, this has become less and less true. Yes, words and pictures can be combined in innumerable ways – but some ways are better than others. What ways are those, and why?

That is the question I seek to answer today.

As an introductory note, for the purposes of this article, I will be treating comics purely as a storytelling medium. It certainly has the capacity to take on other topics, but comics predominantly exist to tell stories, much like modern film, and it is in that context that I continue this quest.

Visual Storytelling & the Image

To understand why the written word is in comics, we must first understand why the image is in comics. What purpose does the image serve?

To tell the story in the most complete and engaging manner that it can. The audience should be able to tell what is generally occurring to the characters – where, how, and when, through the pictures.

But basic legibility, while a requirement, is not the only concern in imagery. Pictures can also convey an astonishing range of emotional impacts, suggest doom or happiness, tell us who the main character is, give us clues as to the emotional states of that character (or any character, really), and when placed into a sequence of images, build up tension or release it. The power of the image in sequence is monumental, and its power is what makes comics, films, and storyboards tick – whether in a drawing or on the camera, the principles of visual storytelling remain constant.

Here is an example:

This is from Barry Windsor-Smith and Roy Thomas’s Savage Tales #2, an adaptation of Robert E. Howard’s “Red Nails.” I have removed the text in order to illustrate what I mean by visual storytelling.

In the first panel, we see a man running towards the corpse of another man – he’s staggering, so he appears to be concerned, or at least tired. The second panel confirms this, as the man stoops down and examines the corpse, cradling its head. This establishes that the two men were allies or at the very least, acquaintances. The background disappears in the third panel and the man looks up from the corpse – the omission of the surrounding space and his pose clues us in – he’s surprised, and looking at something. In the next panel, we see something coming out of the blackness, and we are down low, on the ground plane, at even level with the kneeling man. By placing us looking up at the new threat, the artist is invoking a sense of awe and fear at the dark shape. The next panel zooms in, heightening the tension and revealing the dreaded assailant, as well as our hero reaching for his weapon. For in this panel, he is the hero – we, the viewers, are placed on his level, looking at what he sees, fearing what he fears. We are subconsciously his ally, if not the man himself, and we sympathize – visually.

The next panel pans up to a woman. We know it pans up because of a few factors – the main one being she’d been established as above them on the last page, but the image itself is lit from below, and the woman is looking down – that’s enough visual clue for us to place her in context. If those cues were not there, even had it been established that she was above them on the last page, it would be visually confusing, and subconsciously, we’d be wondering where in the heck she is in relation to what’s occurring. The simple cues given are elementary, yet essential to clarity.

In any case, the woman is also clearly horrified by the sight of the glowing skeletal figure. In the next panel, we zoom in on the enemy, heightening the tension we’ve experienced from watching him emerge. Again we are placed behind the man as his ally, but this time he is inert – paralyzed in fear. The final panel zooms in once again, building our tension to a breaking point as we witness the dread killer raising its sword to strike.

On the next page, the tension breaks with a sweeping down-shot (looking down, that is) as our heroine leaps over the balustrade to combat the evildoer! We are now placed behind her shoulder and assume the position of her ally or advocate. (Placing the viewer shoulder-to-shoulder with a character is not the only way to evoke sympathy visually, but it’s a good one.) Moving forward, the second panel on the page re-establishes the location of all parties with the enemy now turning to face our intrepid heroine.

She strikes! We zoom in to witness a clean hit, the subsequent panel zooming out (releasing the tension) and showing the monster’s head falling from its body. The fifth panel zooms out still further, again re-establishing the location of all parties, and a view downward relieving the fear. (Basic principle: looking up at something, the object witnessed is powerful. Looking down, the viewer is powerful. This relates to security, and thus to safety.) The sixth panel shows the man and woman conversing – he seems quite energetic. She, by contrast, looks tense and worried. We are now standing “with” the man once again, but we can see the woman’s face – and this is another way of identifying who the main character is. So with the viewer behind the man and looking to the woman’s face, the net effect is that the reader sympathizes with both man and woman. To assist this more, the shot is also at an even level, placing us on equal terms with the protagonists.

The last panel zooms in dramatically for a slight up-shot (one looking up) of both our hero and heroine’s faces – notice again that since the down-shot panel after the woman killed the skeleton, that the panels have been steadily zooming in — tension has built again, and the surprised / worried expressions of the two protagonists are crystal clear — they’re not off the hook yet.

Wow. All that from pictures? Of course, we don’t consciously process all of that every time we read a comic – it works subconsciously, and that is half of its power.

But hang on, this article was about text, wasn’t it? If we can get that much of the story from pictures, what purpose does text serve?

The immediate answer that comes to mind is that text should clarify, make specific, and enrich the reading experience.

Boy, is that a slog.  All the text does is simply reiterate what the images are already saying!  Sure, “A fierce exultation sweeps her, as she feels it cleave solid flesh and mortal bone!” is poetic and all, but for an entire two pages, the only nugget of information given that was not eminently clear in the images is that the glowing skeleton was not in fact a glowing skeleton, but a man dressed like one. This could be ascertained by looking carefully at the images anyway!

This kind of redundancy slows the story down considerably, to the point where the comic actually gets in its own way. It is painful to read, and even insulting to any reader’s intelligence. Consequently, many people describe these stories as corny, cheesy, and melodramatic. After this analysis, it’s no wonder why!

So text clearly should not re-state what the images are already saying. What then is the solution?

1. The Cinematic Solution

The most obvious solution to this is to keep text to a minimum – using it only to explain that which needs to be expounded upon specifically, and to enrich the experience selectively. Here’s a good example, coming from the pages of Frank Miller and David Mazzucchelli’s Daredevil: Born Again.

The text is only dialogue and a small bit of what amount to thought balloons for Karen Page – expounding upon her distressing state of affairs and her decision to try for one last shot of heroine before an abusive gangster takes her life. It supplies only detail, subtext, and context. The images supply the rest, and “the rest” is a rather large bulk of the story.

So is that the only way to use text? Simply to reinforce the story already told through images? Seems underwhelming. Is comics not then simply glorified film storyboards?  I thought you said we were a legitimate and unique art form back at the beginning of this whole debacle!

Let’s step back to Scott McCloud for a minute.  In his chapter on text, he notes, “In comics at its best, words and pictures are like partners in a dance and each one takes turns leading.  When both partners try to lead, the competition can subvert the overall goals.” If we take this idea as true, then what happens when we switch the dominant role? What happens when we let text take over the bulk of the storytelling, and let image provide the detail and context?

2. The Literary Solution

This happens:

Despite being a page fragment from Alan Moore and Dave Gibbons’ Watchmen, this sequence of panels poignantly illustrates the idea of what I call the literary solution. The images alone tell us relatively little. Rorschach is writing, Rorschach is walking, and Rorschach is looking down a ladder. It’s worth noting that the change to upshot on the last panel as well as the color palette does create some sense of unease, but we have little precedent for what this might be. The scene then changes, to Rorschach’s shadow in front of Happy Harry’s bar. The final panel shows us the would-be superhero framed ominously in the doorway of what appears to be about the seediest bar on earth.

That’s it! That’s all we got. Why show that? What possible reason could it serve in a story? It’s just a transition, after all.

Now read the text. The story presents itself. We are instantly told of a man on a hunt – the investigation that drives his search, and the record that he keeps of it. In fact, if you disregarded the images entirely, the prose alone would suffice to get across the general idea of what is occurring. Instead of scanning over the text and focusing on the imagery as one normally does in a “cinematic” comics (and to be sure, comics of the “cinematic” variety are far more common than those of this type – not that both approaches cannot be in one story), the image takes on a secondary role, serving to provide the context of where Rorschach is, and the details of his surrounding environment. His quest, his emotions, and the urgency of the situation are displayed through text alone.

3. The “Middle Ground”

So we can run the gamut between word and image in terms of storytelling weight. But is there a middle ground? Can we have text and image be equally important without each becoming redundant as we established back in the first section? Is McCloud right that image and text cannot both lead?


Look at the images. They clearly tell of the Comedian’s changing emotions. He begins as distressed, then he gets angry, even threatening. Despite the fact that we are at a fixed point of view, Gibbons still manages to simulate a “zoom-in” by having the Comedian get closer as he becomes more and more angry. Then he walks back, clearly worried or confused. He takes a long, lovely drink, and then sits back down to break into tears. He seems to be reasoning and protesting until he falls to his knees, cradling a statue of a woman. Again he becomes threatening, the color palette dulling, and the tension builds to a climax as the comedian’s tear streaked face, full of doubt and even pleading, fills the panel. The tension breaks with a slight zoom-out into Rorschach, pulling us back to the current time-frame.

Though we have no specifics, we know essentially what happened. Now look at the text alone. It tells a complete story in and of itself, utterly independent of the images. We see the Comedian’s same arc of emotions simply through what he says and how he says it, but we also see exactly what he’s talking about – the “joke” that he doesn’t get and how it’s connected to everyone. The image and the text tell two independent, yet highly related stories. Neither one is bearing a larger brunt of the storytelling burden, for without the text we don’t really know what is happening and why it is happening, and without the image we don’t really know where it’s happening, or how and when.

Thus both image and text interweave to create a seamless story effect that puts equal emphasis on both players – it is the perfect balance, the climax of the dance, if you would. This is where comics shines.

That’s not to say that a comic can or even should be only this “middle ground” solution all the way through – Watchmen, after all, employs all three solutions that we’ve discussed, and to great effect. Each manner of storytelling has its strengths or weaknesses, and where and when to use each type is a choice that will greatly affect how an audience absorbs a story. So in the end, though I disagree on a few points, Scott’s right – it is sort of like a dance.

Tagged , , , , , , , , , . Bookmark the permalink.


David Balan is a current student and aspiring comics creator, studying sequential art at the Savannah College of Art and Design in Savannah, Georgia. He is working on becoming both a writer and an artist, and he plans to eventually script and draw his own complete graphic novels. You can see his most current portfolio at

See more, including free online content, on .


  1. I agree with what you’ve said here, and you’re right in condemning the redundant text that was once commonplace.

    Is it possible that, at least for most works, some combination of these approaches works best? In other words, the cinematic approach tends to work for action scenes, and the literary approach tends to work in the Watchmen example above. There’s nothing worse than action with very heavy captioning and dialogue, unless that’s to deliberately undercut the action, such as when it’s described later (e.g. in court). I can imagine a chase comic or something that could be entirely literary, but I tend to side with the idea that diversity is key, at least in long-form narrative, and that knowing when to use which approach is key to competence as a creator.

    Thanks for your thoughts! And I especially appreciate the careful walk-throughs of page layout and composition, which people don’t do enough of. :)

    • David Balan says:

      It’s not only possible, I would consider it necessary. The proportion of each one to the other would likely change depending on the story, and it’s not like there’s ONLY three ways to combine them – I’ve illustrated the extremes, but alot of comics simply ‘lean’ one way or the other – image or text bearing the main brunt depending on the scene – rather than the actually quite distinctive approaches in Watchmen. One big rule of any type of storytelling is that any technique of presentation used too much loses its power – this applies to these kinds of combinations too – if you do a comic in ALL of one, it quickly becomes tiring.

      Using the Comedian page as an example, Moore and Gibbons saved that “middle ground” approach for that specific scene, which is both high drama, delivers alot of information, and amounts to the climax of that chapter. Saving it for that scene increased the power of it.

      I love breaking down pages – it’s immensely helpful to be able to do it as a creator, and actually it helps my own analyses to be required to communicate it to someone who potentially knows nothing about what I’m saying (anonymous Sequart reader, as it were) – it increases my own understanding.

      • I agree completely with what you’re saying, and I get that it’s a spectrum. It’s funny that people think of Watchmen as fairly text-heavy for a contemporary comic, but there are long passages with no dialogue or captions… maybe it’s just the 9-panel grid that throws people off, whereas the splash page or a more obvious panel composition draws attention to the image.

        Which brings up panel size and shape, which is a whole other dimension to this topic.

        I really do like the page breakdown approach. It doesn’t work for everything, but it’s sometimes so precise, able to nail things down so well. And yes, make sure that we’re all talking about the same thing.

      • David Balan says:

        Replying to message below!

        Well, Watchmen by the standards of the -comic industry- is text-heavy. One thing I didn’t even get into is the pure text ‘bookends’ in Watchmen which add a great deal of texture and power to the story itself – formally, Moore and Gibbons run the entire gamut of comics in that book – pure image, pure text, and everything inbetween. I think it’s ingenious, and a great way to utilize each tool at your disposal.

        As for panel size and shape, you’re right, it’s a whole other ballgame in terms of storytelling, and it’s one of the tools we have that no other visual medium does. You could easily film a sequence similar to the Comedian page I examined – and it would still be powerful. But if we consider the Barry Smith page up above, the artist is using the size of panels to his advantage very much – that becomes more and more apparent if you read, say, an entire issue or story. Watchmen does it too, just in a much more structured manner. It’s very appropriate for that story, but I wouldn’t say it’s a good way for every story.

        I may get into size & shape as the unique sticking point in terms of visual storytelling in another article – it’s a fascinating topic, one that I haven’t explored as fully as I’d like yet, either.

  2. Could you please upload or provide a link to view the pages/extracts that you use as examples in this article. Specifically, I would really enjoy seeing the images for Alan Moore and Dave Gibbons’ Watchmen and Barry Windsor-Smith and Roy Thomas’s Savage Tales #2 to be able to accurately apply what you have described from the comic books.

Leave a Reply