Data Journalism Part II

Gentle intro to stats, JavaScript and data

Contents

Readings

  • The Data Journalist
    • Chapter 5 (before Part 2.2)
    • Chapter 8 (before Part 2.3)
  • The Grammar of Graphics1

Section 1

This week in which we pick up from Part 1.3 and turn our analysis into a a visualization using the [g2] visualization package. Unless you speak Chinese, you should always open that link in Chrome because the good stuff is all in Chinese. (After all, English speakers have no monopoly on writing of great tools.)

Scaffolding and a bit about HTML

In order to make a visualization appear in the browser, we need to set up a web page. Now we could go about this in a number of ways, but for simplicity we’re going to set this all up in our project folder.

But before we do that, we’ll save our project folder as it currently is by going to the source control tab (third icon from the topj on the side bar) then entering a message (e.g. saving part 1), hitting enter and saying yes to the pop up.

This saves the current files in our project while preserving a history what those files were at the last ‘commit.’

Demo:

Now download this file then unzip it and move the contents into your project file.

Inside should be index.html file, a css folder, a js folder and an images folder. This is a not unusual structure for a web project.

Open the index.html file in your browser and in VS Code and take a look. We’ll come back to HTML in a week or two, but there are some key things to notice.

The name ‘J. Random Student’ appears five times in the HTML: in the title, the description, the author, the headline and the footer.

html marked up

Three of those (title, description and author) are part of the page’s <head>. The other two are inside the page’s <body>.

All about HTML

The <head> ... </head> tag to provides descriptive information about the page. Meanwhile, the <body> ... </body> tag provides the page’s content.

HTML have an open tag (<TAG>) and a close tag (</TAG>). Between those can come plain text, other tags or a mix. Some tags have attributes inside the open tag.

There are a few especially important tags you’ll want to know.

  • In the head
  • title provides the title that shows up in the browser
  • link provides links to key information such as stylesheets (more tk) and no closing tag
  • script contains JavaScript
  • In the body
  • p contains a paragraph
  • a contains the text of a link
    • an href attribute defines where the link goes e.g. <a href="https://google.com">this link</a> produces this link.
  • em puts text in italics
  • strong makes text bold
  • ul contains a bullet list
  • li contains a single bullet item
  • h1,h2,h3,h4,h5 and h6 create headlines and subheds in decreasing order of size and importance
  • img has a src tag with the URL of an image and no closing tag
  • div defines a rectangular area for other tags to populate
  • footer is like a div, but for a page footer content
  • headeris like a div, but for a page header content
  • articleis like a div, but for an article contained on page
  • script can also show up in the body

In addition to the src and href attributes referenced above, there are two attributes worth noting. Every tag can have class and id attributes. class describes a type of entity for use in styling and in JavaScript. id should be unique and allows you to identify a specific tag instance for styling and in JavaScript.

For now: just replace “J. Random Student” wherever it appears with your own name.

Then, before doing anything else, commit all this new work with the message “adding web files”.

New files

Now that we have some of what we need, let’s create a new HTML and a new JS file for our first visualization.

  • Open index.html and use ‘Save As…’ (or Shift Command-A) to create a copy called pay-disparities.html.
  • Create a new file (Command-N) and save it inside the js folder as pay-disparities.js.
  • Go into pay-disparities.html and add the following lines right before the closing body tag (</body>):
    <script src="https://unpkg.com/datalib@1.9.2/datalib.min.js" type="text/javascript"></script>
    <script src="https://unpkg.com/@antv/g2@3.4.10/build/g2.js" type="text/javascript"></script>
    <script src="js/pay-disparities.js" type="text/javascript"></script>
    
  • Inside the div containing the h4 tag add: <div id="payBarChart" class="chart"></div>

What that does is create a web page that loads datalib, g2 and your script and gives us a container (div) in which to put our visualization.

To keep track, you can see the changes I made to the file (along with some optional styling changes) on Github.

Making our graphic

In order to make the graphic, our JS file will need to do TK things:

  1. Load the raw data from the CSV
  2. Process the raw data to give us the data we want to visualize
  3. Configure the chart object
  4. Render the chart object with our data

**Load the raw data from CSV

This part is easy. We’re going to go into pay-disparities.js and begin writing the same kind of code we wrote before. However, this will run in the browser instead of node, so we can skip the “require” lines. (The script tags are handling this for us.)

So …

let rawData = dl.csv('/data/university-of-iowa-1998-and-2018.csv')
console.log(rawData)

But wait … If you open the page in your browser and open the console, you see error messages.

That’s because a browser won’t let JavaScript open files on your hard drive. (This is an important security feature.)

To solve this, we have to do something very clever: we create “local” web server on our computer.

Just go the VS Code’s terminal pane and run http-server. (If you’re on a Mac without node running you can run python -m SimpleHTTPServer instead.)

Now follow this link (this link if you used Python and navigate to your page.

Process the data to give us what we want

As you will recall, at the end of Part 1 we had tried various ways of filtering the salary data. For our purposes, we’re going to look at the pay for full professors.

The value for that for 1998 ‘PROFESSOR’ while for 2018 it’s ‘Professor’. Remember that we use the data’s “filter” method to grab a subset of the records.

let professorSalaries = rawData.filter(row => row.Class.toLowerCase() == 'professor')

(Did you notice how we used the toLowerCase() method on the job classification? All JavaScript strings have it, as well as the toUpperCase() method.)

Next we’re going to group them by year and gender with some useful statistics.

let professorsSummary = dl.groupby('Year', 'Gender')
  .summarize({'*': 'count', Salary2: ['mean','median','min','max','stdevp']})
  .execute(professorSalaries)

Configure the chart object

OK… we’re going to configure a chart. We’ll se the id of its container, its height and whether or not to force fit to the available width.

let config = {
  container: 'payBarChart',
  forceFit: true,
  height: 600
}

Next we’ll create the chart object with G2.

let salaryChart = new G2.Chart(config)

Next we’ll set the chart’s axis, color coding, etc.

salaryChart.axis('Year')
salaryChart.interval()
  .position('Year*mean_Salary2')
  .color('Gender')
  .adjust([{ type: 'dodge' }])
salaryChart.coord().transpose()

Now we load the data and render the chart, including putting in a format for the year.

salaryChart.source(professorsSummary, {
  'Year': { type: 'timeCat', formatter: d =>  String(d) }
})

Render the chart object with our data

the magic

 

salaryChart.render()

 

Assignment

Problem set #4 will challenge you take a similar dataset and do the same kind of analysis we did previously.

Section 2

This week we’ll learn to use Structured Query Language (SQL2) to analyze relational data in a database management system. You’ll really need to read Chapter 5 of The Data Journalist to follow what’s going on here. I expect we’ll do some significant review in class as well.

You’ll also flex your muscles in creating visualizations.

it’s all relative

Chapter 5 should have given you the basics of writing SQL, so I won’t duplicate them. The more critical question is why. Why when we have Excel, when we have JavaScript would we want to delve into this weird little world?

The reason is relational data. Excel and JavaScript packages like datalib assume that we get data in a specific rectangular format: one record per line with all the data we need (generally for a single (observation)[https://stats.oecd.org/glossary/detail.asp?ID=6132]) already associated with it.

That leaves some problems, though. We saw that with the salary data … what happens when we have more than one relevant entry for a single person. How do we sort it out? How are we even sure they’re the same person?

What if we have two databases, one of faculty salaries and one of average faculty ACE ratings? We would hope that the two databases shared a unique key for the faculty member and join them. If not, we might try to create one by combining full name with other identifying information.

That’s how in the 1980s Providence Journal reporter Elliott Jaspin found that numerous school bus drivers had been involved in drunken driving accidents and that who had convictions for drug dealing.

Setting up

we’re going to use a database format SQLite3 and we’re going to query it with a tool called DB Browser for SQLite.

Because of how it was designed, there are lots of tools that can be used to access SQLite files. In fact, we can query them from our JavaScript programs both in Node and in the browser. For example, web browsers generally store your bookmarks and history in a SQLite database hidden away on your computer.

DB Browser is easy to use. On your Mac, you can download it, open the .dmg file and drag the program to your Applications folder. On a lab Mac, you can drag it to your desktop and run it from there.

On Windows PCs, you should probably use the 64-bit Windows version. If there’s trouble with that, download the Portable Apps version to your desktop and run it from there.

Grab some data

There a lots of relational data sets out there. We’re going to take a look at the Fatality Analysis Reporting System that tracks fatal traffic accidents nationwide.

To do that, you’re going to go to the FARS website,follw the link titled ‘Download Raw Data from FTP Site’ then navigate to the 2017 directory, the National directory and the file ‘FARS2017NationalCSV.zip’. (Did you notice the many, many years of data available?)

Once you download and unzip that, rename the resulting directory fars2017 and put it inside the data subdirectory in your project folder. (If you’re not using one, now is definitely the time to start.)

First thing lets do is load the data in…

  • Run DB Browser for SQLite. (From here on, let’s just call it DB Browser.)
  • Create a New Database (file menu) inside the fars2017 directory named fars2017.db.
  • Go to the File menu, then Import then ‘Table from CSV file …’
  • Navigate to your fars data and select ALL the csv files.
  • On the resulting import wizard make sure the ‘Separate tables’ checkbox is checked.
  • Click OK.

Now wait a bit, you should end up with 26 tables.

Before doing anything else, spend 10 to 15 minutes just looking at the data in each file, trying to figure out what’s in there. While you’re doing it, you might want to look at some of the documentation:

Ask a question

Our question is:

What how many of those killed in OWI accidents (in Iowa) are not the intoxicated drivers, and what percent is that?

Seems simple, right?

So how do we break it down?

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

1. How many fatal OWI accidents are there?
SELECT COUNT(*)
FROM accident
WHERE DRUNK_DR = '1'

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

2. How many fatalities are there?
SELECT COUNT(*), SUM(FATALS)
FROM accident
WHERE DRUNK_DR = '1'

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

3. How many total people were involved?
SELECT COUNT(*), SUM(FATALS), SUM(PERSONS)
FROM accident
WHERE DRUNK_DR = '1'

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

4. What percent are fatalities?
SELECT COUNT(*), SUM(FATALS), SUM(PERSONS), SUM(FATALS)/SUM(PERSONS)
FROM accident
WHERE DRUNK_DR = '1'

Huh? By default, SQLITE is treating those sums as integers, so anything less than 1 gets rounded down to 0. Luckily there is the CAST function

SELECT COUNT(*), 
  SUM(FATALS), 
  SUM(PERSONS), 
  CAST(SUM(FATALS) AS REAL)/SUM(PERSONS)
FROM accident
WHERE DRUNK_DR = '1'

Why add the line spacing? Ease of reading. In SQL, whitespace is not significant.

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

4. What percent are the drivers?

shrug

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

⬇️

Now we understand the relation. Each accident can have many people involved. So lets join the tables.

SELECT COUNT(*), 
  SUM(FATALS), 
  SUM(PERSONS), 
  CAST(SUM(FATALS) AS REAL)/SUM(PERSONS)
FROM accident INNER JOIN ON accident.STATE = person.STATE AND accident.ST_CASE = person.ST_CASe
WHERE DRUNK_DR = '1'

Why was that different?

!(shrug)[https://media.giphy.com/media/xT9Igqz02TEWXq0wMw/source.gif]

Because we have one result row for each person but we’re adding data about the accident.

Now we need multiple queries

``` SELECT COUNT(*)

Section 3

This week we’ll put some intellectual rigor behind our use of visualizations. We’ll also dive a bit more into HTML/CSS. You’ll read Chapter 8 of The Data Journalist and the intro to The Grammar of Graphics.


  1. The full text of this $150 book is available, courtesy of UI Libraries. 

  2. Some people pronounce it ESS-queue-ell, others pronounce it sequel. For once, there’s no right or wrong answer. 

  3. This is sometimes also called sqlite or sqlite3.