Though the TensorView tool itself is novel, it leverages existing technologies without which it could not exist. More specifically, it utilizes Google's TensorFlow library and Street View product, acting as an intermediary between these components to allow the bulk-processing of imagery in an efficient manner.
Google Street View
Though it is best known as a purveyor of search engines, the Google corporation runs a number of other projects for both public goodwill and profit. One of these, under the brand name Google Street View, has involved the collection and public dissemination of an extremely large collection of 360-degree street-level imagery, gathered over the course of a decade (Anguelov et al., 2010). Though it has been utilized at a small scale for manual "deskside" surveying and techniques for deriving cityscape information from Street View imagery have been developed, large-scale automated surveying for features of interest using this dataset has not been conducted and its value as a resource for doing this has gone unexplored (Rundle, Bader, Richards, Neckerman, & Teitler, 2011; Odgers, Caspi, Bates, Sampson, & Moffitt, 2012; Hara, Le, & Froehlich, 2013; Kurka et al., 2016; Torii, Havlena, et al., 2009). TensorView exploits Street View imagery through Google's Internet-facing Application Programming Interface (API) (Google Inc., n.d.), downloading the imagery to disk before performing further operations on it.
Google Street View imagery has certain properties that distinguish it from satellite imagery. Though the term "spatial resolution" does not apply in the typical sense in which it is applied to satellite images, Street View imagery does have a type of spatial resolution, defined by the density of panoramic images in a given area. This is a function of the speed of the imagery-collection vehicle at the time when the image was taken, as the Street View camera orb acts on a timer and produces a new panorama every interval. Additionally, due to the need for Google to perform a field survey to collect the imagery, Street View has a low temporal resolution, on the order of months to years - an example of this can be seen in Figure 3.
Figure 3: Subsequent Street View images of the same location on Clair Road East in Guelph. The left image was taken in 2011; the right, in 2016.
In the Street View API, the imagery consists of 640x640-pixel JPEG images, each corresponding to a "slice" of the panorama with a field of view between 10 and 120 degrees - the field of view and direction of each slice being a parameter specified by the user (an example of this can be seen in Figure 4). By downloading an overlapping set of these slices, it is possible to reassemble the entire panorama. This, however, not typically necessary - as buildings tend to appear at eye level, only the imagery at eye level and looking towards the horizon is downloaded, in order to save time and disk space. Which panoramas are to be downloaded is specified by an ESRI-format shapefile provided by the user.
Figure 4: A demonstration of the "slices" system, using 45-degree slices (Google Inc., n.d.).
InceptionV3, TensorFlow, and Neural Network-Based Image Classification
Street View imagery by itself is insufficient for surveying purposes, as it suffers from an issue common to all remotely-sensed data - there is an overwhelming quantity of it. In a typical city, there may be tens of thousands of panoramas, each consisting of as many images as the user cares to download. For a comprehensive survey of an urban area, this can result in hundreds of thousands of separate images. In order to effectively interpret this data, an automated image processing system of some variety must be employed. For the initial proof-of-concept of TensorView, we have utilized the state-of-the-art InceptionV3 image-classification neural network, which is built using Google's TensorFlow library for neural networks (Szegedy, Vanhoucke, Io e, Shlens, & Wojna, 2016; Abadi et al., 2016). It should be noted that InceptionV3 itself is not a Google product, even though it is included "out of the box" with TensorFlow installations.
This system, like all neural networks, utilizes a system of weighted connections between nodes in a directed graph to emulate the connections between neurons in a brain, albeit on a much simpler scale, in order to quickly and repeatably convert an input into a set of outputs that can then be interpreted. In the case of the InceptionV3 network, the input takes the form of a single JPEG image, and the output is an array of values, each of which corresponds to how certain the network is that the image belongs to a certain category (an example of this can be seen in Figure 5). The categories that the network employs are determined when it is trained - a process in which example images (known as the training set) are provided to it along with the categories to which they belong, so that it can "learn" what features distinguish one category from another. How it does this is opaque and not visible to the user, meaning that it can learn to focus on the wrong features if the training set is weak or has unnoticed similarities between images (Yudkowsky, 2008).
Figure 5: An example input image with its top 5 categories. In this case, the classifier is approximately 54% sure that the image is of a fast food restaurant.
Because the neural network operates on single images, a software wrapper must be built around it to provide it with a series of them in order to enable large-scale automated classification. This is one of the roles of TensorView - to download the Street View imagery, and to provide each image in turn to the classifier to operate on.
How TensorView Works
In its current implementation, the TensorView tool's operation falls into three distinct "phases": panorama querying, image retrieval, and image classification. These phases are not seen by the user, and instead operate using the provided shapefile and trained neural network. A simple outline of how TensorView functions, and some example outputs, can be seen in Figure 6.
At its simplest, panorama querying consists of taking every point in the input shapefile in turn and querying the Street View API as to whether or not there is a panorama at or near that point. This is necessary because the API does not provide a means of determining what panoramas are within a certain area - merely whether there is a panorama near a certain lat/long co-ordinate pair (Google Inc., n.d.). By saturating the search area with query points, it is possible to locate all of the panoramas within it. The query also returns the geographical co-ordinates and unique IDs of each panorama (panorama IDs), which are recorded to a file on disk. Future implementations will do away with the need for a points shapefile, instead generating the search points automatically from polygons or lines.
Figure 6: An overview of the inputs, outputs, and internal mechanics of TensorView.
In this phase, TensorView requests each image in turn from the Street View API and writes it to disk. By saving each image using its panorama ID, it is also possible to reference it to a geographical location. How many images are downloaded per panorama, and their particulars, are parameters left up to the user - our example survey, for instance, downloads two panoramas located at 90 degrees left and right of the road. Other surveys might choose to download a single image, a full 360-degree set, or overlapping images.
Once all of the imagery has been downloaded, each image is run through the trained InceptionV3 classifier. If it meets the parameters specified by the user for "interesting", its filename, panorama ID, and geographical co-ordinates are reported back to the user for further manual analysis. At the end of this process, TensorView returns a full list of the images identified as being interesting. Future implementations will automatically write the co-ordinates and other important information to a points shapefile on disk.
In the present implementation, classification is done in a comparatively naïve fashion, with "raw" Street View imagery being provided directly to the image classifier. Though well-suited to some tasks, this approach lacks finesse and is likely to produce false positives. A future goal is the addition of object detection, enabling both "smarter" approaches and the examination of features that are not a prominent part of the image (e.g. mailboxes).