Network analysis, first steps
Since putting together the New York Times identity network, I’ve wanted to look more closely at a larger network of art identities and subjects. I reworked some of the OCLC pipes that pull related identities and associated subjects from an Identity page to output something a bit closer to a TouchGraph data file, wrapped the whole business in a processing sketch, and had it crawl 100 objects from the Met’s Modern Art department.
After some data cleaning, the network contains,
~3,500 nodes = 1,200 related identities + 2,200 associated subjects + 100 Modern Art Records and
~7,200 edges = 1,700 -> related identities + 5,500 -> associated subjects
The same terms can appear as related identities and associated subjects. As in the image above,
Jasper Johns the associated subject is selected, while the identity is in the upper left. I’ve color coded the nodes in the graph (blue identities, gray subjects) and they are distinct in the data.
For a network this large, TouchGraph works well a single node at a time, but extending the locality stressed out my machine and I still wanted to see the whole network. Pajek to the rescue. Below is the 3D force-based layout of only the identities.
Directionality is missing from the images but the edges only go from the numbered nodes (the starting set of Met Modern Art records) to Identity records from OCLC.
The major nodes are those you might expect. Each associated subject is presented in a tag cloud on the Identity page with a variable font size. I’ve used those sizes as edge weights where appropriate and summed them across the network here.
||Sum of Weights
|Criticism, interpretation, etc.||377|
|Museum of Modern Art (New York, N.Y.)||
|De Kooning, Willem 1904-1997||
|Picasso, Pablo 1881-1973||
|Pollock, Jackson 1912-1956||
|Rothko, Mark 1903-1970||
|Marin, John 1870-1953||
|Matisse, Henri 1869-1954||
|Weber, Max 1881-1961||
|Braque, Georges 1882-1963||
|Stieglitz, Alfred 1864-1946||
(Ahem, Metropolitan Museum of Art appears only 3 times in the network.)
I’ve started looking at the network metrics in UCINET and Pajek but I think there has to be something said about validity at this point. What we have is a two-mode network, i.e. a bipartite data set. Not a problem; plenty of ways to look at the data. But this is more an artifact of the data collection method than reality. Object records don’t point to one other and, since I didn’t iterate, there are no connections between the collected nodes. Of course the validity of the whole data set is pretty dubious. My selection criteria were intentionally broad and uninformed – picking the top 3 identies from OCLC and then pulling in everything, ignoring rank and weight in the data collection phase. The initial goal of the pipes was to find out more about the quality of the results from OCLC – to see if a simple query would suffice. So the pipe structure will need to change if we want validity. I don’t know nearly enough about how associated subjects are mapped to identities, or how an identity is “related” to any others, or for that matter how complete the coverage is for Modern Art in OCLC. Ultimately, any analysis will be saying more about the OCLC data surrounding books rather than about the Met’s holdings. I’ll be sure to present the network analysis metrics on a more “complete” dataset.
With all of that criticism about lack of rigor out of the way; Wow. With a large enough starting set, the resulting network gets rid of the noise pretty well. I think this network is a good place to start with clear direction for improvement.