Distant Viewing


Zeroth: Introduction
First: Extract Images
Second: Face Detection
Third: Shot Detection
Fourth: Training Set

Bewitched, S03E11
Bewitched, S05E13
Bewitched, S05E18
Laverne & Shirley, S01E11


The first and current phase of the project has four concrete outputs:

The remainder of this section gives a long-form description of where our project fits in the current research in computer science, digital humanities, and media studies. It is an edited version of the grant narrative submitted for an NEH Digital Advancement Grant.

Enhancing the Humanities

Distant Viewing TV applies computational methods to the study of television series, utilizing and developing cutting-edge techniques in computer vision to analyze moving image culture on a large scale. The project analyzes how visual space is used by characters over a set of fourteen sitcoms from the Network Era of American Television (1952-1985), modeling a new mode of cultural analysis within TV studies. Given that long-running television series broadcast hundreds of episodes, and the major networks run dozens of series each season, previous studies of network television have had to rely on a close analysis of a subset of series, episodes, and scenes (Baughman 1993; Dow 1996; Morely 2003; Spangler 2003). Distant Viewing TV builds off of these with computational approaches that can analyze the contents of tens of thousands of hours of television programming. The analytical approaches offered by computer vision and what Lev Manovich has termed “cultural analytics” (2011) allows the project to compare and contrast within and between television series at unprecedented scale. To do this, we are building the DTV Toolkit, software that automatically extracts metadata features from a corpus of moving images. Specifically, it determines the placement and identities of faces in every shot (see page ) and the identity of the current scene location (see page ). Finally, Distant Viewing TV makes visualizations and new research on copyrighted moving image culture accessible by establishing an interactive, public interface for the dissemination of TV scholarship.

The project is situated at the intersection of media and television studies and digital humanities, where television is garnering increased attention. Digital humanities’ focus on text and, increasingly, methodologies such as distant reading and macroanalysis has produced exciting interventions in digital humanities and fields such as literary studies (Jockers 2013; Moretti 2013). However, there is an increasing call to take seriously visual culture and moving images as objects of study (McPherson 2009; Posner 2013; Acland and Hoyt 2016). This shift has come thanks to emerging computer vision tools that are increasingly open-source. The need for and recent launch of Cultural Analytics, an open-access journal for the computational study of culture, is a testament to the growing field of scholars eager to use computational techniques to study a variety of cultural forms. Distant Viewing TV is a part of this growing body of work that is innovating within the digital humanities to ask new questions in media studies.

Like the digital humanities, media studies has privileged textual sources and feature-length film. As described by Fiske (1978): “Television suffered categorical disadvantage in repute ... its characteristic was oral [and visual] not literate, whereas ‘dominant culture’ ... was militantly committed to print-literacy and the values associated with that.” From the 1950’s onward, however, television has arguably served as one of, if not the, dominant source of mass entertainment in the United States. By 1959, over 83% of households in the US owned their own television set (Baughman 1993). While there is a growing body of television scholarship in media studies, now known as TV studies, the leading scholarship in the field focuses on “prestige television,” a term used to describe TV that is designed to appeal to middle- and upper-class viewers, critics, and awards committees (Lotz 2014; Thompson and Mittell 2013; Wiliams 2014). The challenge with this increased attention to prestige television is that it reinforces a bias toward “cinematic” or “novelistic” television and deepens an ingrained view of Network Era comedy as formally simple, even simple-minded, without any artistry.

A handful of television scholars take form seriously. Jason Mittell and John Caldwell are interested in the sophistication or stylization of television form in a contemporary context, while Jonah Horwitz’s study of post-war television form focuses on live theatrical playhouse broadcasts (Caldwell 1995; Horwitz 2014; Mittell 2015). None of these scholars look at Network Era television as a specific formation with common formal patterns or use computer vision in the course of their argumentation. The gap in this literature reflects, first, a scholarly emphasis on the “close reading” of middle-brow media, and, second, reflects the challenges of studying the volume of television produced during these years. Bringing a “distant reading” approach, or what we term “distant viewing,” to these programs will shed light on the formal rigor of these programs while making possible a comprehensive study of American television comedies over a thirty year span.

While the study of prestige television’s high production values and experimental techniques makes the genre ideally suited to close reading, the repetitiousness and genre conventions of studio filming are best served by a distant viewing practice in which computational tools uncover the subtleties of form and the evolution of a program’s style over time. For example, All in the Family is strongly associated with one set: Archie Bunker’s living room. The depth and quality of the show are produced by, not constrained by, the limited set. Computational tools are needed to uncover the complexities of the multi-camera set-up, in which the cameras are largely stationary and the action is filmed in a sound stage or studio. For example, political debates in All in the Family are staged in the Bunker living room. How are these scenes blocked? Where are wife Edith and daughter Gloria blocked in contrast to patriarch Archie and his son-in-law “Meathead”? How do their placements reveal shifting power dynamics and character relationships that evolve over the course of the series?

Looking at form in this way offers new ways of considering the relationship between television and U.S. culture, an interest of television scholars from the founding of the field (Newcomb 1974). Scholars who have studied network-era television have most often been interested in the ways television has produced or challenged race and gender hierarchies (Lipsitz 1990; Spiegel 1994; Douglas 1995; Acham 2005; Desjardins 2015). Often missing from cultural studies approaches are accounts of form and style. Methods of “distant viewing” allow us to return to these cultural questions with tools that provide new approaches, test old approaches, and develop new knowledge.

Distant Viewing TV makes three interventions in TV studies. First, the project uses computational techniques to expose the complexities of network sitcom form from its beginnings. Rather than assuming TV was simplistic during different periods, we argue that TV’s form changed over time in ways that cannot be reduced to judgments of simple v. complex, or wasteland v. prestige (Newman and Levine 2011). Second, the project analyzes, as never before, formal complexity in early television and situation comedy. Third, the project allows scholars to augment cultural studies approaches with new ways of viewing.

Distant Viewing TV’s initial corpus includes a diverse set of situational comedies from the Network Era of American Television (1952-1985). The corpus contains a total of 2310 episodes, with 947 hours of material, 77 starring characters, and approximately 61 unique scenes. We selected all sitcoms that were one of the top two sitcoms in a given season and in the top ten TV shows overall. The result yields a set of fourteen series that include a mix of black and white, color, multiple-camera set-ups, single camera set-ups, film, and video formats. There is also good coverage over the decades and across the networks. A complete list of the shows is given as a table in the appendix (page ).

Our work attempts to address several types of questions of interest within television studies. These include: How do shot scale, editing, and other formal qualities make series distinctive or illustrate change across the span of a series? How can we best characterize the narrative arc of an episode? Does the typical sequence of locations change throughout the run of a show or across different shows? How is social difference constructed through form? Does formal dominance support or undercut forms of cultural dominance? How does the move from black and white to color film produce changes in form, style, or character dynamics? These questions address key issues of television studies: auteurism, formalism, narrative, and culture.

In order to study key issues in TV studies, the project will develop the DTV Toolkit, which will be built on top of three novel computer vision libraries. All of these libraries make use of deep convolutional neural networks (CNN), a leading technique in machine learning. Specifically, the project will draw from:

These specific algorithms were chosen due to their open-source licenses, use of the most up-to-date techniques, and the institutional support behind all three at CMU, MIT, and Berkley, respectively. OpenFace will be used directly to localize and identify faces in still images. Out of the box, Places-CNN estimates the location of an image from a pre-defined set of 1000 locations. The Colorization library takes a black and white image and produces an estimate of a colorized version. Part of our work in building the DTV Toolkit consists in modifying these three libraries for our specific humanities-centric needs. For example, we will colorize the black and white images and then apply the face detection algorithm on the colorized version of the image, because OpenFace only supports color images. Tweaks to tuning parameters will be needed for this process to run smoothly. The Places-CNN algorithm must also be modified; in place of a fixed set of pre-defined locations, we will have it adaptively learn the specific scene locations present in a given television series.

An important benefit of convolution neural networks is the ability to use new, domain specific data to improve the performance of a generically trained algorithm, a process called transfer learning. A major part of building the DTV Toolkit will be applying transfer learning to tweak the open-source computer vision algorithms to better function on our corpus. This will be done by first hand-labeling a training set of 5000 images and then algorithmically “learning” new weights for the deep neural networks that better adapt to our training set. While developed from Network Era television sitcoms, the completed DTV Toolkit will be able to be applied to any corpus of moving images.

Finally, the project will produce a public website in order to make the findings accessible to a broader audience. While the raw visual material is still under copyright, the extracted time-coded metadata can be publicity shared under transformative use (Thompson 2010). The website will build off of the approach taken by Jeremy Butler on the ShotLogger project, where he develops techniques for public scholarship with copyrighted data. Interactive visualizations will allow users to engage in exploratory analysis of the corpus. A detailed description of the website’s contents are laid out in the Work Plan section below.

Environmental Scan

Beginning with the early work of Barry Salt (Salt 1974), quantitative methods for analyzing moving images have focused predominantly on the distribution and patterns of shot lengths. Prominent examples include Yuri Tsivian’s Cinemetrics project (http://www.cinemetrics.lv/), Arclight Guidebook to Media History and the Digital Humanities (Acland and Hoyt 2016), and Jeremy Butler’s ShotLogger (http://shotlogger.org/). These projects demonstrate the feasibility of distributing extracted metadata from copyrighted materials and the power of computational techniques to extract useful information over a large collection of moving images. However, there is much to be learned from other extractable metadata beyond shot detection.

Relatively few studies and tools have been built for extending the quantitative approach of Salt to other parameters. Limited examples include the (now defunct) Videana: A Software Toolkit for Scientific Film Studies (Ewerth et al. 2009) and the language and color analysis of Burghardt, Kao & Wolff (2016). Distant Viewing TV expands on these by incorporating the significantly richer features of face detection, character disambiguation, and scene classification. Incorporating a much wider and more granular set of features, the project gives a more complete view of the formal decisions made by actors, writers, camera operators, directors, and editors.

Recent scholarship in computer vision and machine learning has developed libraries for extracting many types of metadata from images. As mentioned in the previous section, we plan to tune and incorporate three of these in our project. The need for our DTV Toolkit as a separate library is driven by several factors. Current tools are not directly applicable to black and white images, which leads to the need to combine the power of the colorization neural networks of Zhang et al. (2016) with those for face (Amos et al. 2016) and scene (Zhou et al. 2016) recognition. Also, the current scene classifiers are built to classify into a pre-defined set of categories. Our training process will use unsupervised learning so that the algorithm will automatically determine the scene present in a given series. Finally, the DTV Toolkit will take explicit advantage of our corpus as a collection of moving images.


Acland, Charles R. and Eric Hoyt, Editors. The Arclight Guidebook to Media History and the Digital Humanities. REFRAME Books, 2016.

Ajmera, Jitendra, Iain McCowan, and Hervé Bourlard. “Speech/music segmentation using entropy and dynamism features in a HMM classification framework.” Speech communication 40.3 (2003): 351-363.

Amos, Brandon, Bartosz Ludwiczuk and Mahadev Satyanarayanan. “OpenFace: A general-purpose face recognition library with mobile applications.” 2016.

Baraldi, Lorenzo, Costantino Grana, and Rita Cucchiara. “Measuring Scene Detection Performance." Iberian Conference on Pattern Recognition and Image Analysis. Springer International Publishing, 2015.

Baughman, James L. “Television Comes To America, 1947-1957.” Illinois History 46.3, 1993.

Baughman, James L. The Republic of Mass Culture: Journalism, Filmmaking, and Broadcasting in America Since 1941. JHU Press, 2006.

Buckland, W. “What Does the Statistical Style Analysis of Film Involve? A Review of Moving into Pictures. More on Film History, Style, and Analysis.” Literary and Linguistic Computing, 23(2): 219-30.

Burghardt, M., Kao, M., Wolff, C. “Beyond Shot Lengths – Using Language Data and Color Information as Additional Parameters for Quantitative Movie Analysis.” In: Digital Humanities 2016: Conference Abstracts. Jagiellonian University & Pedagogical University, Kraków, pp. 753-755.

Butler, Jeremy G. “Toward a Theory of Cinematic Style: The Remake.” Morrisville, NC: Lulu, 2003.

Butler, Jeremy G. Television: Critical Methods and Applications, 4th Edition. New York: Routledge, 2012.

Caldwell, John T. Televisuality: Style, Crisis, and Authority in American Television. Rutgers University Press, 1995.

Cervone, Alessandra, et al. “Towards Automatic Detection of Reported Speech in Dialogue Using Prosodic Cues.” Sixteenth Annual Conference of the International Speech Communication Association. 2015.

Cosentino, S., et al. “Automatic Discrimination of Laughter Using Distributed sEMG.” Affective Computing and Intelligent Interaction (ACII), 2015 International Conference on. IEEE, 2015.

Dow, Bonnie J. Prime-Time Feminism: Television, Media Culture, and the Women’s Movement Since 1970. University of Pennsylvania Press, 1996.

Ewerth, R., Mühling, M., Stadelmann, T., Gllavata, J., Grauer, M. and Freisleben, B. Videana: A Software Toolkit for Scientific Film Studies. In: Ross, M., Grauer, M. and Freisleben, B. (eds.), Digital Tools in Media Studies –- Analysis and Research. An Overview. Bielefeld: tanscript Verlag, pp. 100-16.

Fiske, John. Reading Television. Routledge, 1978.

Horwitz, Jonah. “Visual Style in the ‘Golden Age’ Anthology Drama: The Case of CBS.” Cinémas: Journal of Film Studies 23, no. 2–3 (2013): 39–68.

Kar, T., and P. Kanungo. “A Texture Based Method for Scene Change Detection.” 2015 IEEE Power, Communication and Information Technology Conference (PCITC). IEEE, 2015.

Kumar, Rupesh, Sumana Gupta, and K. S. Venkatesh. “Cut Scene Change Detection Using Spatio-temporal Video Frames.” 2015 Third International Conference on Image Information Processing (ICIIP). IEEE, 2015.

Pulver, Andrew, Ming-Ching Chang, and Siwei Lyu. “Shot Segmentation and Grouping for PTZ Camera Videos.” 10th Annual Symposium on Information Assurance (ASIA 2015). 2015.

Manovich, Lev. The Language of New Media. MIT press, 2001.

Mittell, Jason. Complex TV: The Poetics of Contemporary Television Storytelling. NYU Press, 2015.

Moore, Barbara, Marvin R. Bensman, and Jim Van Dyke. Prime-Time Television: A Concise History. Greenwood Publishing Group, 2006.

Morley, David. Television, Audiences and Cultural Studies. Routledge, 2003.

Moretti, Franco. Distant Reading. Verso Books, 2013.

Newman, Michael Z. Video Revolutions: On the History of a Medium. Columbia University Press, 2014.

Sanders, Jason, Gabriel Taubman, and John J. Lee. “Background Audio Identification for Speech Disambiguation.” U.S. Patent No. 9,123,338. 1 Sep. 2015.

Salt, Barry. “Statistical Style Analysis of Motion Pictures." Film Quarterly 28.1 (1974): 13-22.

Silverstone, Roger. Television and Everyday Life. Routledge, 1994.

Spangler, Lynn C. Television Women from Lucy to Friends: Fifty Years of Sitcoms and Feminism. Greenwood Publishing Group, 2003.

Sun, Yi, Ding Liang, Xiaogang Wang, and Xiaoou Tang “DeepID3: Face Recognition with Very Deep Neural Networks.” arXiv preprint arXiv:1502.00873 (2015).

Thomson, Kristin. Report of the Ad Hoc Committee of the Society For Cinema Studies, “Fair Usage Publication of Film Stills". 2010.

Tsivian, Yuri, and Gunars Civjans. “Cinemetrics: Movie Measurement and Study Tool Database.” (2011).

Xu, Peng, Lexing Xie, and Shih-Fu Chang. “Algorithms And System For Segmentation and Structure Analysis In Soccer Video.” ICME. Vol. 1. 2001.

Zhang, Richard, Phillip Isola, and Alexei A Efros. “Colorful Image Colorization”. European Conference on Computer Vision (ECCV), 2016.

Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. “Learning Deep Features for Discriminative Localization.” Computer Vision and Pattern Recognition (CVPR), 2016.