Data Journalism Part I

Gentle intro to stats, JavaScript and data


Section 1

This week’s main readings are Chapter 1 of The Data Journalist and Chapter 1 of Precision Journalism by Phil Meyer. You’ll also read these two pieces: How an internet mapping glitch turned a random Kansas farm into a digital hell from Splinter and LAPD’s public database omits nearly 40% of this year’s crimes from the LA Times.

So before you jump into this, please read them.

read me

OK, so you’re ready to get started?

This week we’re going over: the “data journalism” tribe, the definition of data and programming as a second language

Our tribe

To oversimplify: data journalists are journalists whose primary journalistic activities regularly involve data. But in truth, it’s more than that…

I like the term “tribe” to describe data journalism/ists. It’s not defined by a specific output (like photojournalism or graphics). It’s not defined by a sepcific style of writing (like narrative journalism). It’s not defined by a specific skillset (like copyediting or design). It’s not defined by a specific topic (like sportswriting). In some senses, it’s hard to define data journalism but we know it when we see it.

Lets just take the issue of polling. Which of the following is generally accepted as data journalism?

a. Conducting a political poll
b. Conducting a survey to identify the causes of a riot
c. Doing a meta-analysis of polls to predict likely winners

Yes to b and c. But a? Not really. Part of that is because journalistic political polls are typically subcontracted, and as a result the data journalism community doesn’t contain many pollsters. On the other hand, many of us have developed surveys to conduct specific pieces of research.

I put data journalists on a “spectrum” best explained using a “radar chart”1 of their comparative skillsets.

all data journalists comparison radar chart

data reporter radar chart

graphic artist radar chart

news apps developer radar chart

news librarian developer radar chart

The main point here is that data journalists do a ton of types of work, not just one thing. We also have a lot of differing skill sets.

What is data?

To start with, the correct question is, “What are data?”2 Because the word ‘data’ is a plural noun. The singular, datum, refers to a single piece of structured information.3 ‘Data’ then is the plural, meaning a group of (similar) pieces of stuctured information.

xkcd cartoon

Data, most importantly, is simply a bunch of information collected in a structured way by people for a particular purpose. It’s fashionable to talk about data as “biased,” but that term has a specific meaning that is not what its users mean.

It’s more correct to think about data as being “cultrually constructed,” with the definitions, measurements and methods of collection all in reference to the people doing it. It’s also important to understand that most data is not collected to answer the questions we’d like to ask of it.

I like the think of data as a very honest with a perfect memory and not much common sense. He knows what he has seen or been told, but not always what we need to know. He will answer our questions honestly to the best of his ability, even when the answers he is providing are … not … quite … right.

So why do we like data?

The thing that’s important about data is that they are honest (though not always accurate) and because they are structured, they are subject to analysis.

In fact, when we understand it, we can turn a dataset’s weakness – being collected for a purpose different than ours – into a strength. If you understand how it was used by the people collecting it, you can discover truths that its creators might not have want to be discovered.

For example, in a piece called “The Strong Arm of the Law,” colleagues and I wrote about how police use a “cover charge” such as resisiting arrest or interfering with a public official againt people they have hurt or abused. The existence of the charge makes it much harder to prove out a police brutality complaint.

It turns out that a disproportionate share of such cases were dismissed, often the next day. When there were other substantive charges (e.g. resisting arrest and assault), that was less likely. When the person arrested was black, that numbers were even worse.

Trust me when I say, the police department didn’t want to collect the data and wouldn’t share it if they had.

So we gathered data from the municipal court system. Those data was collected to track the outcomes of cases and identify the parties involved. That included the name and race of defendants (so you know which Robert Smith you’re talking about) as well as the disposition of the case.

We then could see the pattern – black folks were disproportionately being charged with those crimes without a substaintive companion charge. Everyone – but especially black folks – being charged that was disproprotionately likely to have the case dropped without charges.

So we couldn’t say, “the data show police are racist and using these charges to cover up abuse.” What we could say was that police were making a lot arrests that didn’t stick, and were therefore presumptively unjust. We also could say prosecutors knew it, since they were dropping so many of the cases.

That allowed us to dig into individual cases to prove the violence and bias involved – which we found.

What we’re gonna learn

So data journalists do lots of different things, and different data journalists do different subsets of those things, what are we going to learn?

  • Data Visualization because it gives us a great way to tell stories and is a highly in-demand skill
  • Data Analysis because good journalism is a search for the truth
  • Programming because it is a toolset we can use for everything data journalists do

For programming we’ll primarily use JavaScript. Here’s why…

We wont’t learn everything about JavaScript. We will learn just enough. If you end up wanting to go deeper, here’s a 134-part series you might enjoy.

For data analysis we’ll use tools including Excel, Google Sheets, datalib, jSql (pronounced ‘j squirrel’) and ml.js.

For data visualization, we’ll primarily use the g2 data visualization tool. There are other great tools like d3, but g2 maps really nicely to other important ideas while remaining easier to use. Plus, it has documentation in Chinese.

Programming as a second language

Now my favorite part. Programming for word people.

See, talking about “programming” or “coding” makes it seem pretty far from something most journalists are comfortable with. It sounds technical and hard. Same with programming.

But what about writing a program? A little less scary. After all, most journalists are writers. How about writing in a language computers understand? Another little bit less scary.

Which is why:4


Human vs. computer langauges

Human languages have nouns, pronouns, verbs, adjectives, adverbs and prepositions. (There’s other stuff, but take a linguistics class for that.)

Those words are assembled using punctuation into sentences. Sentenchs have subjects (that act), objects (that are acted upon) and actions. Sometimes they use conjunctions, which combine multiple sentences into one.

Those sentences are assembled into paragraphs, which work together. Often those are combined into larger collections of language such as an article or book.

There are rules for all of this called grammar.

The equivalents in programming langages:

  • a verb refers to an action or activity; a function causes the computer do something
  • a noun refers to a person, place or thing; a variable stores a value
  • adjectives and adverbs modify nouns and verbs; operators modify a variables and functions
  • a sentence describes an event or circumstance; a statement executes a function or defines the state
  • conjunctions create compound sentences; control flow statements or operators create compound statements
  • a pargraph is a group if related sentences; a block is a group of related statements
  • punctation assembles sentences and paragraphs; it’s used in code to assemble statements and blocks (although they tend to be called operators or symbols)
  • spaces, tabs, line breaks, et cetera are used to clarify meaning; programmers call it whitespace and use it the same way5
  • lists are an ordered set of nouns or phrases; arrays are an ordered set of variables and functions
  • grammar sets out the rules of a language; programmers call it syntax

Object-oriented programming

In regular language, some nouns refer to simple things like “red,” which describes a color. Others refer to type of thing (like person) or specific instance of that category. As a class, people have common properties (e.g. name, hair color, job title) and specific things they can typically (e.g. talk, move, invent).

Object-oriented languages work the same way. JavaScript is object-oriented.

A specific peron was Grace Hopper, one of the first and most important computer programmers, had a name. Her job title was “Rear Admiral,” she had grey hair for much of her life and she could definitely talk. She invented the COBOL programming language, one of the first programming languages which was “high level” to make it easier for humans to read/write.

Variables are similar they can be a scalar (containing a number, string, etc.), an array (containing a list) or and object. Typically the generic version of an object is called a prototype (also a class) and defines its standard properties (a type of variable) and methods (a type of function). An instance of a class fills in those properties a inherits the methods.6

Step 1: Hello journalism!

The first thing you’re going to do is open up the “JavaScript console” on your web browser. (Sorry, has to be a desktop web browser.)

  • In Chrome on a Mac: press Option-Command-J
  • In Chrome on a PC: press Shift-Control-J
  • In Safari on a Mac: press Option-Command-C after first enabling the developer menu
  • In Firefox on a Mac: press Option-Command-K
  • In Firefox on a PC: hit Shift-Control-K

That gives you something like this:

js console

(That’s Chrome for the Mac.)

The first thing we’re going to do it execute a function, (remember that’s like a verb)

In your terminal type: console.log('Hello journalism.')

You should get …

hello journalism results

Did it work?

Tina Fey high fiving a million angels

You’re officially a programmer now. Feels good, right?

Step 2: Use a noun

OK, we’re going to start using simple variables. Remember, those are like nouns. They contain a value.

You’re going to create a variable called journalism and set it equal to rocks!

It’s pretty easy. Just type let journalism = 'rocks!' and hit enter.

This happens…

Kind of disappointing. Now type journalism and hit enter.

Right? You defined journalism as the string of letters ‘rocks!’ Then you recalled it.

Let’s do it some more. Create variables named tv, print and web and set them equal to 'TV news', 'Newspaper journalism' and 'Online reporting'.

Then lets try printing some things out. To do that, we need to …

Step 3: Execute a function on variables

To use a function on variables we call it pass and them as parameters (think the direct object in a sentence).

Try executing the following:

  • console.log(tv, journalism)
  • console.log(print, journalism)
  • console.log(web, journalism)

Did you get this?

Step 4: Combining variables

OK, since typing two parameters in each time we called the console.log was pretty exhausting, let’s learn how to combine them.

Try let tvJournalism = tv + ' ' + journalism and log it to the console. Do something similar for printJournalism and webJournalism. (Notice how we capitalize any words after the first one? That’s called Camel Case and it’s used for variable and function names.)

Step 5: Build your own function

What if someone miaguidedly doesn’t realize that journalism rocks?

Lets create a function so they can have their own say.

You create a function by defining its name and the parameter(s) it will take. Let’s try commentOnMedia with a parameter of opinion. It would look like this:

function commentOnMedia (opinion) {


function tells use its a function, commentOnMedia is the function’s name and the list of parameters is inside the (). The { } punctuation goes around the block of code the function executes.

Try typing commentOnMedia(). What happens?

blank comment on the media function

No a lot, right? Because the function was empty. Lets fix that. Create a function containing all the statements to take their opinion, append it to each type of journalism and logs all three to the console.

function commentOnMedia (opinion) {


Here’s what we want it to do…

output from the comment on the media function

Deep breath … What does that look like?

Here's a place to write and edit your function.

Which brings us to this week’s assignment.

Assignment 1

Week 1 problem set


Section 2

Last week you fired up the JavaScript console in your web browser and learned a bit about what’s good and bad when it comes to data. This week, we’ll set up a “development environment” and learn to use JavaScript to test for statistical significance.

This week’s main readings are Chapter 3 of Precision Journalism by Phil Meyer, the AP Stylebook chapter on polls and surveys.

I also want you to watch this video from a helpful Australian woman.

Now, let’s get started.

Typing into the browser console is great and all, but most of you already learned how … annoying … it can be when you start doing complex things like writing functions. Plus, retyping everything every time if we don’t have to?

why, why would you do that?

So what’s a “development environment?” Think of it as a little corner of your computer you set up to set up just for doing what you. The trick to a dev enviornment,7 is that we set it up in a way that is portable, and replicable. In other words, so we can set it up anywhere we have access to the tools we need.

To do that, we need some tools.

Our tools

  • Github is a community for sharing and editing code. Set up an account if you don’t have one.
  • Visual Studio Code is a ‘code editor,’ basically just a way to edit files with some extra features for programmers. It’s on the lab machines already.
  • Node.js is a command program (not in the browser) for running JavaScript programs. It’s on the lab machines already.
  • git is part of the command line tools on the Mac. You can download it for Windows

  • On the Mac:
    • The terminal is on all Macs8
    • On the Mac: We’ll use the command developer line tools for the Mac. These are on the lab machines already.
  • On the PC

These need a bit of setup.


1. Open up the terminal

On a Mac the Terminal is in the Utilities folder inside the Applications folder (from here on out lets call that /Applications/Utilities). Once you’re running it, right click on it in the Dock and choose Options -> Keep in Dock so it’s easy to find.

terminal setup

Once you’re in there you can type in commands, which is why we call it the command line.

If you’re on your own Mac, from the terminal window you can install the command line tools by typing xcode-select --install then hitting return and following the instructions. You don’t need to do this in the lab.

2. Configure git to use your github account

You’ll need to run some things on the command line to set up git so it plays nicely with your github account.

Type git config --global "" with the email address you used for github and hit return.

Type git config --global "Your Name" (but with your actual name) and hit return.

3. Fork our base project folder on Github (and set it private if you choose)

Go to then hit the “Fork” button in the upper right hand part of the page. If you haven’t signed in, you’ll be prompted to. This called “forking” (making your own copy) a “repository” (set of files stored in git).

fork button location

4. “Clone” that repository to your computer

OK, now we’re going to get these files on your computer.

Click the green “Clone or download” button from your copy of the project. Then click the pasteboard icon to copy its URL.

get the git repo's url

Go to the terminal. If you want your files inside your “Documents” folder type cd documents and hit return. Otherwise, if you want your files on your Desktop, type cd desktop instead and hit return. (Don’t worry, you can move this around later like a normal folder.)

Type git clone then one space then paste the URL you copied and hit return.

clone the repo

Once that’s done, you’re going to see something like this:

clone the repo results

Now type cd jmc-3640-project, hit return. Then type npm install and hit return. This will set up some JavaScript ‘packages’ (other people’s code) we can use. For now, you’ll have to trust me on this.

5. Set up VS Code

It’s on the lab machines. On your own machine download it from the Visual Studio Code website. On a Mac, drag to your applications folder. On a PC, run the installwer. On a Chromebook, no dice.

Here are some videos to learn more. I, for example, changed the “color theme” using the Code->Preferences->Color Theme menu item.

Now use the Code->Preferences->Extensions menu item and in the ‘Extensions’ pane to the right, search for exec. You’ll install ‘Node Exec’ then choose reload.

Hint: You probably want to add Code to your Dock as well.

6. Open the repo in Code

Use File -> Open Folder... and navigate to the files you downloaded with git.

Next use the View->Explorer menu item to open up the file explorer in the right pane (if it’s not already active). Click on index.js to open the demo file.

Hit F8.

You’ll see something like the following:

node exec runner output

(If instead your computer starts playing iTunes, you need to hold down the fn key with F8.)

Had enough?

grey's anatomy exhaustion (Probably time to go AFK and have a recreational beverage of your choice. Mine’s coffee, but YMMV.)

Time do so some journalism

OK, now we’re going to analyze some data for a story and make a pointy clicky thing.

Get some data

I want you to create a data folder inside your project and download the following data into it:

Baseball payrolls and World Series appearances 1998 - 2018

Data sources: I typed in the World Series info from Wikipedia. And copied and pasted payroll data from this guy.9

If you double click it, it will open in Excel. You can see that it’s a table of data with a header and one row per team per year. Take a look.

But it’s not an Excel file. It’s a “csv” (comma-separated value) file, it’s one of the most generic, most flexible data file structures around. It’s a kind of the data lingua franca. We love csv files.

Now write some code to read it

This is the first time we’re going to use a ‘package’. Basically it’s just some code someone already wrote. Think of it like using a citation to bolster your argument (although a better analogy would be sampling).

We’re going to use a ‘package’ called datalib to translate it from text to JavaScript variables. That’s called ‘parsing’ (thus the name).

In order to use datalib, we need to bring it into our program. We can do this with the require() function. It looks like this. (We already installed it earlier with that npm install command.)

let dl = require('datalib')

Now we’ll use datalib’s csv function to read our csv file.

let data = dl.csv('./data/bball-salaries-1.csv')

The ./ means the current folder (aka directory). The data/ means the data subfolder (subdirectory). Then bball-salaries-1.csv is the file.

Lets see what that looks like. Add console.log(data) on its own line and hit F8.

scrolling results

What do we do with it?

OK, I hear you asking, what do we do with it? To start with, lets see if payroll effects the ouctome in baseball.

In order to run these tests we’ll be using datalib’s groupby and summarize functions to calculuate some data.

That looks a bit like this:

let payrollByPennant = dl.groupby('Pennant')
  .summarize({'*': 'count', Rank: ['mean', 'median', 'stdevp']})

console.log('payroll by league pennant status')

Then comment out those console.log calls and instead get and print out the relevant data.

To get the mean and standard deviation among penant winners, you’d do this. In these data the pennant winners come second. Later on, we’ll learn to set this order ourselves.

let pennantMeanRank = payrollByPennant[1].mean_Rank
let pennantStdev =  payrollByPennant[1].stdevp_Rank

Now write a console log statement to output that info as:

The mean pennant winner’s payroll was ranked 12.34
The standard deviation pennant winners’ payrolls was ranked 5.67

(Obviously, you want the real numbers.)

Next, do the same thing for the pennant losers. Is the difference between them more than both their deviations multiplied by 1.96 and added together?

We can do the same kind of thing by changing the parameter to groupby. An array (e.g. ['Pennant', 'WS']) will create groups by each unique combination. An empty call will generate an overall summary.

Now it’s time for this the Week 2 problem set.

Section 3

This week we will:

  • learn more about summarizing data
  • make our first visualization
  • identify our own stories in a dataset

Data: Filtering, Mapping and Binning

In Section 2, we learned to use the datalib package to create summaries for groups of records. Now we’re going to learn how to use built-in Array methods and additional methods of the dl object to ramp up our analytical prowess.

Part of how we’ll do that is by learning to read the API documentation for datalib. You can find it on the datalib’s Github repo.

What’s an API?
API stands for “application programming interface.” In short that means how you, as a programmer, access the various capabilities and contents of what you’re dealing with. While that often means a package, other things have APIs. The term is commonly used to describe how you access data and take actions programmatically on a web service. For example, you can use Twitter’s API to search for data, check your timeline or even make posts.

Step one, get the data. For the tutorial, we’ll use a spreadsheet University of Iowa employee salaries from the 1998 and 2018 fiscal years. More info about the data can be found on the Iowa Legislature’s site from which I downloaded it.

But before that, we’re going to start a new JS file for this week’s work. Let’s call it salaries.js.

We spent some time in class looking at the spreadsheets directly.

First lets require datalib and import the csv.

let dl = require('datalib')

let salaries = dl.csv('data/university-of-iowa-1998-and-2018.csv')

Next let’s summarise it with the groupby and summarize methods.

let summary = dl.groupby('Year')
  .summarize({'Salary2': 'sum', '*': 'count'})


(Notice that we use the ⇥ (tab) key to indent the summarize and execute calls.)

Run it.


Notice how different those two total counts are. Lets try finding how many unique names to see if 1998 has a lot of duplication.

let summary = dl.groupby('Year')
  .summarize({'Salary2': 'sum', '*': 'count', 'Name': 'distinct'})


Run it.


In both cases, there don’t seem to be a lot of duplicated names. So, mystery not solved. But at least we don’t have duplicated data.

That print out is kind of ugly isn’t it? Let’s fix that. There’s a relevant method in the datalib documentation.

dl.format.table() takes a dataset and turns it into a text-based table

let summaryTable = dl.format.table(summary)

Let’s run it.


Did you notice that there is a “Gender” field in our table? I did. I’d like to know whether the UI has a gender disparity by pay, and if so whether it has gotten better or worse over the past two decades.

The first step is to group by gender. So we’ll update our groupby statement.

let summary = dl.groupby('Year', 'Gender')
  .summarize({'Salary2': 'sum', '*': 'count', 'Name': 'distinct'})


There are a few eecords in here (2 in 2018 and 12 in 1998) that are missing gender. In 2018 they’re U (often used for unknown) and in 1998 they’re * (often used for bad or missing values).

Lets take a look at them? We’re going to use the filter method of Arrays, like our list of salary data.

let missingGender = salaries.filter( row => (row.Gender != 'M' && row.Gender != 'F'))
let missingGenderTable = dl.format.table(missingGender)

There’s a lot going on here. row => (row.Gender != 'M' && row.Gender != 'F') creates a function that is run on each row of the data. Then we check that the Gender is not equal (!=) to ‘M’ and (&&) that it’s not equal to ‘F’. Then we format the table and log it.


The 2018 data only has two temporary professors with missing gender. The 1998 data seems to have a couple significant earners we’d need to figure out. Especially since they appear to have multiple otherwise identical records under different Classifications.

Sill, we’re going to filter them out in the short term. We’ll do that by commenting out the missing gender statements, then writing a similar filter for the salaries before we summarize it. We want values that equal (==) ‘F’ or (||) equal (==) ‘M’.

So before the let summary = ... line add the following: (line added 2/7/2019)

salaries = salaries.filter( row => (row.Gender == 'M' || row.Gender == 'F'))


Now this is pretty interesting. Not only are there a lot fewer salaries listed in 2018, but the gender balance is different. We’d want to figure that out as well.

Meanwhile, let look at the means, medians and standard deviations. To do that, we’re going to change the value of ‘Salary2’ from just ‘sum’ to an array of summaries.

let summary = dl.groupby('Year', 'Gender')
  .summarize({'Salary2': ['sum', 'median', 'mean', 'stdevp'], '*': 'count', 'Name': 'distinct'})


That’s starting to be a bit hard to read, right? For one thing, the order is wonky, right?

Lets sort. Just like we filtered the records down to those we want by using the .filter() method, we can sort with the .sort() method. Sorting is a bit more complex, but luckily datalib hides a lot of that from us.

We sort using the dl.comparator(sortby) function. We just pass a column name or array of column names in. By default the sorting order is small to large (0 thru 10, a thru z, A thru Z) – but we can reverse it for a given column with a -.



But what if we want to be sure we’re sorting by year, then by gender?



Now, remember how we check is two values are statistically significantly different? We can approximate that with the ‘z test’ to see if the means are at least 95% likely to have come from a different universe. (p < 0.05 that they’re the same).

To do that, we need to start by filtering for a specific year:

let salaries2018 = salaries.filter( row => (row.Year == 2018))

Now we’re going to use datalib’s dl.z.test() method. Doing that will us to tell it which records belong in each sample. We’ll do that by splitting it into two separate arrays of just the salaries.

let salariesFemale = salaries2018.filter( row => (row.Gender == 'F')).map(dl.$('Salary2'))
let salariesMale = salaries2018.filter( row => (row.Gender == 'M')).map(dl.$('Salary2'))

let pValue = dl.z.test(salariesFemale, salariesMale)


So we can be more than 99.99 percent sure there is a reason for the difference.

What could some of those be? (This will be answered on the quiz.)

Lets take a look at the variety of job titles there are …

let classifications = dl.groupby('Classification')
  .summarize({'*': 'count', 'Salary2': 'mean'})

let classificationsTable = dl.format.table(classifications)


Whoa Nellie! That is a long, messy list. Let’s edit that puppy to sort set some options on dl.format.table().

let classifications = dl.groupby('Class')
  .summarize({'*': 'count', 'Salary2': 'mean'})

let classificationsTable = dl.format.table(classifications, {maxwidth: 40, limit: 10})


(code above corrected 2/7/2019)

One step we can take is to control by a classification that includes both job title and seniority. We should probably take one of the top 10 to start with. ‘Custodian I’ looks good.

let femaleCustodians = salaries2018.filter( row => (row.Class == 'Custodian I')).filter(row => (row.Gender == 'F')).map(dl.$('Salary2'))
let maleCustodians = salaries2018.filter( row => (row.Class == 'Custodian I')).filter(row => (row.Gender == 'M')).map(dl.$('Salary2'))
let pvalueCustodians = dl.z.test(femaleCustodians, maleCustodians)

(Code updated 2/7/2019)

Did you get that p equals roughly 0.314? So we cannot say with any certainty that there is a difference.

Let’s try another one?

let femaleAssistantProfessors = salaries2018.filter( row => (row.Class == 'Assistant Professor'))
  .filter(row => (row.Gender == 'F'))
let maleAssistantProfessors = salaries2018.filter( row => (row.Class == 'Assistant Professor'))
  .filter(row => (row.Gender == 'M'))
let pvalueAssistantProfessors = dl.z.test(femaleAssistantProfessors, maleAssistantProfessors)
console.log('Assistant Professors p-value = '+ pvalueAssistantProfessors)

(Code updated 2/7/2019)


Assignment 3

problem set

  1. I rarely like radar charts, they’re an example of something a bar chart usually does better. But in this case, they do a pretty goof job. 

  2. Think about it this way, you wouldn’t say, “What is airplanes?” 

  3. Fun fact, datum actually come from Latin as a form of the word “to give” (do) because a datum is a “given.” 

  4. Try decoding that at RapidTables 

  5. JavaScript, which we’re using, doesn’t normally use them 

  6. Prototypes and classes are subtly different in ways that don’t matter to us. But while most languages use only classes, JavaScript uses prototypes and translates its classes into prototypes behind the scenes. 

  7. See what we did there? dev = development (or developer) 

  8. Except that dusty iMac in your parents’ basement. Technically it’s only been part of MacOS since OS X launched in 2001. 

  9. What could possibly go wrong?