The Programming Historian (Feb 2018)
Dealing with Big Data and Network Analysis Using Neo4j
Abstract
In this lesson we will learn how to use a graph database to store and analyze complex networked information. Networks are all around us. Social scientists use networks to better understand how people are connected. This information can be used to understand how things like rumors or even communicable diseases can spread throughout a community of people. The patterns of relationships that people maintain with others captured in a network can also be used to make inferences about a person’s position in society. For example, a person with many social ties is likely to receive information more quickly than someone who maintains very few connections with others. Using common network terminology, one would say that a person with many ties is more central in a network, and a person with few ties is more peripheral in a network. Having access to more information is generally believed to be advantageous. Similarly, if someone is very well-connected to many other people that are themselves well-connected than we might infer that these individuals have a higher social status. Network analysis is useful to understand the implications of ties between organizations as well. Before he was appointed to the Supreme Court of the United States, Louis Brandeis called attention to how anti-competitive activities were often organized through a web of appointments that had directors sitting on the boards of multiple ostensibly competing corporations. Since the 1970s sociologists have taken a more formal network-based approach to examining the network of so-called corporate interlocks that exist when directors sit on the boards of multiple corporations. Often these ties are innocent, but in some cases they can be indications of morally or legally questionable activities. The recent release of the Paradise Papers by the International Consortium of Investigative Journalists and the ensuing news scandals throughout the world shows how important understanding relationships between people and organizations can be. This tutorial will focus on the Neo4j graph database, and the Cypher query language that comes with it. - Neo4j is a free, open-source graph database written in java that is available for all major computing platforms. - Cypher is the query language for the Neo4j database that is designed to insert and select information from the database. By the end of this lesson you will be able to construct, analyze and visualize networks based on big — or just inconveniently large — data. The final section of this lesson contains code and data to illustrate the key points of this lesson.