Live Training post!

Check out my latest shiny app!

https://arilamstein.shinyapps.io/final/

Advertisements

Tutorial Recap: Analyzing Census Data in R

A big thanks to Gabriela de Quieroz for organizing the San Francisco R-ladies Meetup, where I spent a few hours yesterday introducing people to my census-related R packages. A special thanks to Sharethrough as well, for letting us use their space for the event.

It was my first time running a tutorial like this, and I spent a while thinking about how to structure it. I decided to have everyone analyze the demographics of the state, county and ZIP Code where they are from, and share the results with their neighbor. I think that this format worked well – it kept people engaged and made the material relevant to them.

Several people wanted to participate but were unable to make it. A common question was whether the event was going to be live streamed. While it was not streamed, the slides are now available on the github repo I created for the talk (see the bottom of the README file). As always, if you have any questions about the packages you can ask on the choroplethr google group.

Personally, I’d like to see more live tutorial sessions in the R community. There’s tons of interesting niches in the R ecosystem that I would like to know more about. If you have a tutorial that you’d like to run, I suggest contacting the organizer of your local R meetup. If you live in the SF Bay Area that is probably the Bay Area R User Group and R-ladies.

The main challenge I found was getting everyone’s computer set up properly for the event. I created a script that I asked people to run before the event which installed all the necessary packages. That script helped a lot, but there were still some problems that required attention throughout the night.

Need help with an R related project? Send me an email at arilamstein@gmail.com.

Upcoming Tutorial: Analyzing US Census Data in R

Today I am pleased to announce that on May 21 I will run a tutorial titled Analyzing US Census Data in R. While I have spoken at conferences before, this is my first time running a tutorial. My hope is that everyone who participates will learn something interesting about the demographics of the state, county and ZIP code that they are from. Along the way, I hope that people become comfortable doing exploratory data analysis with maps, learn a bit about geography and leave with a better understanding of how US Census data works. Here is the description:

In this tutorial Ari Lamstein will explain how to use R to explore the demographics of US States, Counties and ZIP Codes.  Each person will analyze their home state, county and ZIP code. There will be an emphasis on sharing results with each other. We will use boxplots and choropleth maps to visualize the data.

Time permitting we will also explore historic demographic data, learn more about the data itself, and how to use the Census Bureau’s API.

Attendance is free. If you are interested and can make it to the event, then I hope to see you there!

Note that the current draft of my slides is available on github. After the talk I plan to publish my slides on slideshare, where I have placed slides from my previous talks.

choroplethr v3.1.0: Better Summary Demographic Data

Today I am happy to announce that choroplethr v3.1.0 is now on CRAN. You can get it by typing the following from an R console:

install.packages("choroplethr")

This version adds better support for summary demographic data for each state and county in the US. The data is in two data.frames and two functions. The data.frames are:

  • ?df_state_demographics: eight values for each state.
  • ?df_county_demographics: eight values for each county.

These statistics come from the US Census Bureau’s 2013 5-year American Community Survey (ACS). If you would like the same summary statistics from another ACS you can use these two function:

  • ?get_state_demograhpics
  • ?get_county_demograhpics

For more information on the ACS and choroplethr’s support for it, please see this page.

Relation to Previous Work

In many ways this update is a continuation of work that began with my April 7 guest blog post on the Revolution Analytics blog. In that piece (Exploring San Francisco with choroplethrZip) I explored the demographics of San Francisco ZIP Codes. Because of the interest in that piece, I subsequently released the data as part of the choroplethrZip package. This update simply brings that functionality to the main choroplethr package.

Note that caveats apply to this data. ACS data represent samples, not full counts. I simplify the Census Bureau’s complex framework for dealing with race and ethnicity by dealing with only White not Hispanic, Asian not Hispanic, Black or African American not Hispanic and Hispanic all Races. I chose simplicity over completeness because my goal is to demonstrate technology.

Explore the Data Online

You can explore this data with a web application that I created here. The source code for the app is available here. This app demonstrates some of my favorite ways of exploring demographic data:

  • Using a boxplot to explore the distribution of the data
  • Exploring the data at both the state and county level
  • Using choropleth maps to explore geographic patterns of the data
  • Allowing the user to change the number of colors used:
    • 1 color uses a continuous scale, which makes outliers easy to see
    • Using 2 thru 9 colors puts an equal number of regions in each color. For example, using 2 colors shows values above and below the median

In my opinion, datasets like this really lend themselves to web applications because there are so many ways to visualize the data, and no single way is authoritative.

Selected Images

One of my biggest surprises when exploring this dataset was to discover its strong regional patterns. For example, the regions with the highest percentage White not Hispanic residents tend to be in the north central and north east. The regions with the highest percentage of Black or African American not Hispanic residents is in the south east. And the regions with the highest concentration of Hispanic all Races is in the south west:

state-white

state-black

state-hispanic

Switching to counties shows us the variation within each state. And switching to a continuous scale highlights the outliers.

county-black-continuous

county-hispanic-continuous

choroplethrZip v1.3.0: easier demographics, national maps

Introduction

choroplethr v3.0 is now available on github. You can get it by typing

# install.packages("devtools")
library(devtools)
install_github('arilamstein/choroplethrZip@v1.3.0')

Version 1.3.0 has two new features:

  1. Data frame df_zip_demographics contains eight demographic statistics about each ZIP Code Tabulated Area (ZCTA) in the US. Data comes from the 2013 5-year American Community Survey (ACS).
  2. Function ?get_zip_demographics will return a data.frame with those same statistics from an arbitrary ACS.

Data

Here is how to access the data:

library(choroplethrZip)
data(df_zip_demographics)

?df_zip_demographics
colnames(df_zip_demographics)
 [1] "region" "total_population" "percent_white" 
 [4] "percent_black" "percent_asian" "percent_hispanic" 
 [7] "per_capita_income" "median_rent" "median_age"

summary(df_zip_demographics[, "total_population"])
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
 0    721     2802   9517 13000   114700

Mapping the Data

Here is a program which will create national maps of the data:

# for each column in the data.frame
for (i in 2:ncol(df_zip_demographics))
{
 # set the value and title
 df_zip_demographics$value = df_zip_demographics[,i]
 title = paste0("2013 ZCTA Demographics:\n",
                colnames(df_zip_demographics[i]))

 # print the map
 choro = zip_choropleth(df_zip_demographics, title=title)
 print(choro)
}

Note that national zip maps can take a few minutes to render. Here is the output.

total_population

percent_white

percent_black

percent_asian

percent_hispanic

per_capita_income

median_rent

median_age

New Vignette

Additionally, I have created a new vignette about these features.

Advice to JETs on Learning Japanese

I recently spoke to an outbound JET and was reminded of my own experience on the program (Mie-ken, 2000-2002). I gave him some tips on learning the language and then realized that my advice might be useful to other JETs as well. I hope that by publishing this advice I can help participants on the JET program make progress in their study of the language.

My main advice is:

  1. Hire a professional Japanese teacher
  2. Use a flashcard program

Additionally I recommend:

  1. Passing a proficiency exam
  2. Trying Remembering the Kanji

The rest of this article explains how these points have helped me in my own studies of the language, both while in Japan and after.

Hire a Professional Japanese Teacher

When I arrived in Japan I did not know Japanese. I also did not know how to learn Japanese. My fellow Assistant Language Teachers (ALTs) recommended that I self study with a textbook between classes that I was teaching. They also placed a premium on learning kanji, and used the number of kanji someone knew as a yardstick for their overall proficiency with the language.

I followed this advice for about a year and then signed up for an intensive three week class at a Japanese school. Before classes began I took a placement exam and was placed in the beginner class. This placement bothered me, though, because I had been self studying for a year. Shouldn’t I at least be in the intermediate class? When I mentioned this to my teacher she invited me to discuss it with her – in Japanese.

This shocked me for two reasons. The first was that I had never had a conversation like that in Japanese before. As an ALT most of my discussions with Japanese people were in English. Speaking in Japanese at work could even be considered a breach of etiquette, since my job was to speak in English. This meant that, by and large, my time speaking Japanese was limited to things like asking for directions and ordering food in a restaurant. Until that point, no one had asked me to defend an opinion, or persuade them to take a course of action, in Japanese.

The second surprise was that I couldn’t do it. All the kanji I had learned and times I had asked for directions did not prepare me the moment when someone would patiently wait for me to add more detail to an answer.

That moment was 15 years ago, but I remember it like it was yesterday. I learned a lot during those three weeks. But what I remember most is that moment, when I realized how much a professional Japanese teacher had to teach me.

Luckily, you no longer need to visit an intensive language school to get access to a professional Japanese teacher. You can simply take lessons online. I have taken lessons at both iTalki and the Japanese Online Institute (JOI) and can recommend them both based on my personal experience. What really surprises me is how cheap they are. At the time of this writing trial lessons on iTalki are as low as $5 for 30 minutes. And JOI offers a trial of three 50-minute lessons for $9.

I recommend finding a professional teacher that works for you and taking weekly lessons.

Use a Flashcard Program

Like most JETs, when I came back home I had less opportunities to use Japanese. Eventually my Japanese atrophied.

I only began studying it again last year. I was originally hesitant, and thought that it might not possible to learn outside of Japan. It turns out that I wrong. In fact, due to the development of flashcard apps I am now finding it easier to increase my Japanese vocabulary than when I lived in Japan.

The program I use is Anki. My favorite feature of Anki is its use of spaced repetition. In short, each day Anki decides which flashcards I should review that day. It does this by keeping track of my performance on each card. The better my performance on a card, the less often I need to review it. This allows me to minimize the number of reviews I do each day. Note that in addition to using Anki for vocabulary, you can also use it for grammar. For example, you can have a card where one side is a sentence in English, and the other side has the Japanese translation of that sentence.

An additional benefit of Anki is its mobile app. Because of this I can review flashcards while commuting to work or waiting on line in a store.

Lastly, many people like Anki because of its shared deck feature. For example, it’s possible that someone has already created a deck with the vocabulary from a textbook that you are using. If so, you can easily import that deck into your copy of Anki.

Last year I attended the annual Japan.R conference in Tokyo and gave a 30 minute presentation in Japanese on work I had done on statistical software. I did a lot to prepare for the talk, but the foundation of my preparation was weekly lessons with a professional teacher and daily usage of Anki.

Pass a Proficiency Test

Studying a foreign language is a long process and it helps to have some milestones along the way. To help with this I recommend taking the Japanese Language Proficiency Test (JLPT). The test is offered twice a year in Japan and has five levels: N5 is the easiest and N1 is the hardest. Oftentimes N2 and N1 are required for working in a Japanese language environment. Many JETs arrive in Japan speaking very little Japanese and do not use Japanese after leaving the program in a year or two. Because of this, I recommend focusing on the easier levels of the exam. I personally found signing up for the exam to be motivating.

When I was in Japan I was aware of the JLPT, but did not have a good idea of how the levels corresponded to real world skills. Here are some links to help with that:

  1. Official JLPT Can-do List. Results of a survey of people who passed each level of the JLPT. They were asked questions such as “Can you write a simple self-introduction in Japanese?” and “Can you understand definitions in Japanese-Japanese dictionaries?”
  2. Summary of Linguistic Competence. This is how the test organizers describe what it means to pass each level of the exam.

The JLPT is a pass/fail exam, and each level has a very low pass rate.  You can see the pass rates on past exams by clicking the “details” button on this page. I am not sure why the pass rates are so low, but I wish that they were higher. I suspect that if more students worked with professional teachers then the pass rates would indeed be higher. A teacher can both help you select a passable level and help you to prepare for it. For many professional Japanese teachers, preparing students to pass the JLPT is a large portion of their business.

In my case I took my first JLPT exam last December (literally the day after I gave my presentation). I took the N4 and passed. Many people told me that N4 would be too easy for me and encouraged me to take the N3 instead. They were wrong. In particular, I found the grammar section to be extremely challenging.

I left the test realizing that if I wanted to improve my Japanese I should focus more on grammar. In that sense, the feedback from the exam was invaluable.

Try Remembering the Kanji

I mentioned earlier that, in retrospect, I spent too much time learning kanji in my first year. I still believe that. However, I do wish that I had discovered James Heisig’s book Remembering the Kanji earlier in my studies. The introduction to the book, which is available online, is a great read in and of itself.

The core of his system is to assign a keyword to each kanji, and exploit that fact that many kanji are built as combinations of other kanji. You can then easily (at least sometimes) make stories for the keyword of each kanji based on the keywords of the primitives. Once you have a story based on the primitives, it becomes very easy to write the character from memory. He does this for each of the 2,136 kanji in the joyo kanji list.

While I generally like the system, I do wish that he wrote a companion version that did this for just the kanji relevant for the various JLPT levels. As it turns out, people have done this online at the Japanese language forum koohii; they call it “RtK Lite”.

Conclusion

The JET program presents a unique situation for learning Japanese. Many people are placed in rural locations in Japan for 1-3 years with no prior knowledge of Japanese and no access to professional Japanese teachers. Their jobs require them to speak English and, even outside of work, many people will encourage them to speak English. Additionally, many JETs will not use Japanese after leaving the program.

As a result of this many JETs leave Japan having attained a lower level of proficiency with the language than they would have liked. I hope that the above advice helps JETs to attain whatever level of proficiency they aspire to.

choroplethr v3.0.0 is now on CRAN

Today I am happy to announce that version 3.0.0 of my R mapping package choroplethr is now available from CRAN. To get it, you can simply type:

install.packages("choroplethr")

from an R console. If you don’t know what any of this means then I recommend taking Coursera’s excellent R Programming class 🙂

The most notable change in version 3.0.0 is that all functionality for handling zip codes has been deprecated and moved to a new package, choroplethrZip. I will have a separate post about that package later; here I just want to highlight changes to the main choroplethr package.

This change required me to break backwards compatability with previous versions of the package. This does not happen very often, so I took the opportunity to make other significant changes which had been building up.

A common request has always been to zoom further in. To support this for the county_choropleth function I broke the zoom parameter into two separate parameters: state_zoom and county_zoom. As an example, let’s make two choropleth maps:

  1. The population of all counties in New York State
  2. The population of all counties in New York City
library(choroplethr)
data(df_pop_county)

county_choropleth(df_pop_county, 
 title      = "2012 NY State County Population Estimates",
 legend     ="Population",
 num_colors = 1,
 state_zoom = "new york")

The image is informative: while there are a few high population counties in northern New York, the real population centers are the southern counties around New York City. However, they are so small that it is hard to tell which of the five counties in New York City is the most populous. We can zoom in on those counties by setting county_zoom to a vector of numeric FIPS County Codes.

# the county FIPS codes of the five counties that make up New York City
nyc_county_fips = c(36005, 36047, 36061, 36081, 36085)
county_choropleth(df_pop_county, 
 title       = "2012 NY City County Population Estimates",
 legend      = "Population",
 num_colors  = 1,
 county_zoom = nyc_county_fips)

nyc pop

And now it is clear that Brooklyn and Queens are the most populated counties in New York City.

Users of past versions of choroplethr will note another change. The parameter that used to be called buckets is now called num_colors. I hope that this change will make it easier for users to understand exactly what the parameter does. By setting it to 1 we see a continuous scale. By setting it to any value in [2, 9] we see that many colors:

county_choropleth(df_pop_county, 
 title      = "2012 NY State County Population Estimates",
 legend     = "Population",
 num_colors = 9,
 state_zoom = "new york")

The default value is still 7.

The choroplethr package used to support visualizing data from the US Census Bureau’s American Community Survey with the function choroplethr_acs. This function has now been broken up into two functions: state_choropleth_acs and county_choropleth_acs. Here are some examples of creating maps of US Per-capita Income. To learn more about the code B19301 (and how to find the codes of other interesting tables), see the article Mapping US Census Data.

# Per capita income of New York, New Jersey and Connecticut. 
states = c("new york", "new jersey", "connecticut")
state_choropleth_acs("B19301", 
                     num_colors = 1, 
                     zoom       = states)

To see more detail, we can look at the distribution of income in those states by county:

county_choropleth_acs("B19301", 
                      num_colors = 1, 
                      state_zoom = states)
3-state-income-county

As before, it is difficult to see the differences between the five counties that make up New York City. We can zoom in on those by setting the county_zoom parameter:

# see above for the definition of nyc_county_fips
county_choropleth_acs("B19301", 
                      num_colors  = 1, 
                      county_zoom = nyc_county_fips)
nyc-income

As always, if you create any interesting maps using choroplethr please consider sharing them with me either via twitter or the choroplethr google group.