Political Analysis Using R

Free download. Book file PDF easily for everyone and every device. You can download and read online Political Analysis Using R file PDF Book only if you are registered here. And also you can download or read online all Book PDF file that related with Political Analysis Using R book. Happy reading Political Analysis Using R Bookeveryone. Download file Free Book PDF Political Analysis Using R at Complete PDF Library. This Book have some digital formats such us :paperbook, ebook, kindle, epub, fb2 and another formats. Here is The CompletePDF Book Library. It's free to register here to get Book file PDF Political Analysis Using R Pocket Guide.

STATA users, probably. I am also making all the data and code easily available for you all via GitHub.

  1. The origins of alchemy in Graeco-Roman Egypt.
  2. Political Analysis Using R [PDF], eBooks, ePub!
  3. The Diversion of Land: Conservation in a Period of Farming Contraction;

And of course, check out my R package that provides some tools for using data science to analyze politics , called politicaldata , which you can install via GitHub. Please see this page for my guide on how to analyze and visualize political data in R. It is not quite long enough to be a book, but will in the end be fairly comprehensive. By default, R excludes any observation from either dataset that does not have a linked observation e.

So if you use the defaults and the new dataset includes the same number of rows as the two old datasets, then all observations were linked and included. For instance, we could type: dim hmnrghts. This would quickly tell us that we have observations in both of the inputs, as well as the output dataset, showing we did not lose any observations. Other options within merge are all. In this case, R would encode NA values for observations that did not have a linked case in the other dataset.

As a final point of data management, sometimes we need to reshape our data. In the case of our merged data set, hmnrghts. Wide format means that each row in our data defines an individual of study a country while our repeated observations are stored in separate variables e. In most models of panel data, we need our data to be in long format, or stacked format. Long format means that we need two index variables to identify each row, one for the individual e.

Meanwhile, each variable e. R allows us to reshape our data from wide to long, or from long to wide. Hence, whatever the format of our data, we can reshape it to our needs. To reshape our political terror data from wide format to long, we use the reshape command: hmnrghts.

Using R for Social and Political Analysis | 12222

Within the command, the first argument is the name of the data frame we wish to reshape. The varying term lists all of the variables that represent repeated observations over time. Tip: Be sure that repeated observations of the same variable have the same prefix name e. The timevar term allows us to specify the name of our new time index, which we call year.

The idvar term lists the variable that uniquely identifies individuals countries, in our case. With direction we specify that we want to convert our data into long format. Lastly, the sep command offers R a cue of what character separates our prefixes and suffixes in the repeated observation variables: Since a period.

A preview of the result can be seen by typing head hmnrghts. We now have a new variable named year, so between COW and year, each row uniquely identifies each country-year. Since the data are naturally sorted, the top of our data only show observations. Typing head hmnrghts.

Recommended for you

As a final illustration, suppose we had started with a data set that was in long format and wanted one in wide format. To try this, we will reshape hmnrghts. To do this, we type: hmnrghts. A few options have now changed: We now use the v. The timevar parameter now needs to be a variable within the dataset, just as idvar is, in order to separate individuals from repeated time points. Our direction term is now wide because we want to convert these data into wide format.

  • What is Kobo Super Points?.
  • Join Kobo & start eReading today.
  • Cooking Together: Real Food for the Whole Family.
  • Lastly, the sep command specifies the character that R will use to separate prefixes from suffixes in the final form. By typing head hmnrghts. This chapter has covered the variety of means of importing and exporting data in R. It also has discussed data management issues such as missing values, subsetting, recoding data, merging data, and reshaping data. With the capacity to clean and manage data, we now are ready to start analyzing our data.

    We next proceed to data visualization. As a practice dataset, we will download and open a subset of the American National Election Study used by Hanmer and Kalkan These data are in Stata format, so be sure to load the correct library and use the correct command when opening. Hint: When using the proper command, be sure to specify the convert. The variables in this dataset all relate to the U. The variable exptrnout2 can be ignored.

    Once you have loaded the data, do the following to check your work: a If you ask R to return the variable names, what does the list say? Is it correct? Use the summary command on the whole data set. What can you learn immedi- ately? How many missing observations do you have? Try subsetting the data in a few ways: a Create a copy of the dataset that removes all missing observations with listwise deletion.

    How many observations remain in this version? Create two new indicator variables.

    The first should be coded 1 if the person identifies as Democrat in any way including independents who lean Democratic , and 0 otherwise. The second new variable should be coded 1 if the person identifies as Republican in any way including independents who lean Republican , and 0 otherwise. For each of these two new variables, what does the summary command return for them? What does the summary command return for this new variable? Use the table command to see the frequency of each category. Did you code the new version correctly?

    Chapter 3 Visualizing Data. Visually presenting data and the results of models has become a centerpiece of modern political analysis. Many of Political Sciences top journals, including the American Journal of Political Science, now ask for figures in lieu of tables whenever both can convey the same information. In fact, Kastellec and Leoni make the case that figures convey empirical results better than tables. Cleveland and Tufte wrote two of the leading volumes that describe the elements of good quantitative visualization, and Yau has produced a more recent take on graphing. Essentially these works serve as style manuals for graphics.

    Do two variables substantively appear to correlate? What is the proper functional relationship between variables? How does a variable change over space or time? Answering these questions for oneself as an analyst and for the reader generally can raise the quality of analysis presented to the discipline. On the edge of this graphical movement in quantitative analysis, R offers state-of-the-art data and model visualization. Many of the commercial statistical programs have tried for years to catch up to Rs graphical capacities.

    This chapter showcases these capabilities, turning first to the plot function that is automatically available as part of the base package. Second, we discuss some of the other graph- ing commands offered in the base library. Finally, we turn to the lattice library, which allows the user to create Trellis Graphicsa framework for visualization. A more comprehensive history is presented by Beniger and Robyn Springer International Publishing Switzerland 33 J.

    Although space does not permit it here, users are also encouraged to look up the ggplot2 packages, which offers additional graphing options. Chang , in particular, offers several examples of graphing with ggplot2. In this chapter, we work with two example datasets. The first is on health lobbying in the 50 American states, with a specific focus on the proportion of firms from the health finance industry that are registered to lobby Lowery et al.

    A key predictor variable is the total number of health finance firms open for business, which includes organizations that provide health plans, business services, employer health coalitions, and insurance. The dataset also includes the lobby participation rate by state, or number of lobbyists as a proportion of the number of firms, not only in health finance but for all health-related firms and in six other subareas.

    These are cross-sectional data from the year The complete variable list is as follows: stno: Numeric index from that orders the states alphabetically. No control variables in the prediction. Second, we analyze Peake and Eshbaugh-Sohas data on the number of television news stories related to energy policy in a given month. In this data frame, the variables are: Date: Character vector of the month and year observed. Energy: Number of energy-related stories broadcast on nightly television news by month.

    Unemploy: The unemployment rate by month. Approval: Presidential approval by month. Coded 0 otherwise. Presidential speeches: Additional indicators are coded as 1 during the month a president delivered a major address on energy policy, and 0 otherwise. The indicators for the respective speeches are called: rmn, rmna, grf, grf, grfa, jec, jec, jec, grfs, jecs, and jecs. As a first look at our data, displaying a single variable graphically can convey a sense of the distribution of the data, including its mode, dispersion, skew, and kurtosis.

    The lattice library actually offers a few more commands for univariate visualization than base does, but we start with the major built-in univariate commands. Most graphing commands in the base package call the plot function, but hist and boxplot are noteworthy exceptions. The hist command is useful to simply gain an idea of the relative frequency of several common values.

    We start by loading our data on energy policy television news coverage. Then we create a histogram of this time series of monthly story counts with the hist command. The file is available from the Dataverse named on page vii or the chapter content link on page You may need to use setwd to point R to the folder where you have saved the data. After this, run the following code: pres. The result this code produces is presented in Fig. In this code, we begin by reading Peake and Eshbaugh-Sohas data. The data file itself is a comma- separated values file with a header row of variable names, so the defaults of read.

    Once the data are loaded, we plot a histogram of our variable of interest using the hist command: pres. We use the xlab option, which allows us to define the label R prints on the horizontal axis. Since this axis shows us the values of the variable, we simply wish to see the phrase Television Stories, describing in brief what these numbers mean.

    The main option defines a title printed over the top of the figure. In this case, the only way to impose a blank title is to include quotes with no content between them. The abline command is a flexible and useful tool. The name a-b line refers to the linear formula y D a C bx.

    Hence, this command can draw lines with a slope and intercept, or it can draw a horizontal or vertical line. This is added to clarify where the base of the bars in the figure is. Finally, the box command encloses the whole figure in a box, often useful in printed articles for clarifying where graphing space ends and other white space begins.

    As the histogram shows, there is a strong concentration of observations at and just above 0, and a clear positive skew to the distribution. In fact, these data are reanalyzed in Fogarty and Monogan precisely to address some of these data features and discuss useful means of analyzing time-dependent media counts.

    Overall Evaluation

    Another univariate graph is a box-and-whisker plot. R allows us to obtain this solely for the single variable, or for a subset of the variable based on some other available measure. First drawing this for a single variable: boxplot pres. The result of this is presented in panel a of Fig.

    In this case, the values of the monthly counts are on the vertical axis; hence, we use the ylab option to label the vertical axis or y-axis label appropriately. In the figure, the bottom of the box represents the first quartile value 25th percentile , the large solid line inside the box represents the median value second quartile, 50th percentile , and the top of the box represents the third quartile value 75th percentile.

    The whiskers, by default, extend to the lowest and highest values of the variable that are no more than 1. The purpose of the whiskers is to convey the range over which the bulk of the data fall. Data falling outside of this range are portrayed as dots at their respective values. This boxplot fits our conclusion from the histogram: small values including 0 are common, and the data have a positive skew.

    Box-and-whisker plots also can serve to offer a sense of the conditional distribution of a variable. For our time series of energy policy coverage, the first major event we observe is Nixons November speech on the subject. Hence, we might create a simple indicator where the first 58 months of the series through October are coded 0, and the remaining months of the series November onward are coded 1. Once we do this, the boxplot command allows us to condition on a variable: pres.

    This output is presented in panel b of Fig. The first line of code defines our pre v. Notice here that we again define a vector with c. Within c, we use the rep command for repeat. So rep 0,58 produces 58 zeroes, and rep 1, produces ones. The second line draws our boxplots, but we add two important caveats relative to our last call to boxplot: First, we list pres.

    This gives us more control over how the horizontal and Television Stories Before Nov. Post Nov. In the subsequent command, we add axis 1 the bottom horizontal axis , adding text labels at the tick marks of 1 and 2 to describe the values of the conditioning variable. Afterward, we add axis 2 the left vertical axis , and a box around the whole figure.

    Panel b of Fig. Much smaller values persist before Nixons speech, while there is a larger mean and a greater spread in values afterward. Of course, this is only a first look and the effect of Nixons speech is confounded with a variety of factorssuch as the price of oil, presidential approval, and the unemployment ratethat contribute to this difference. Bar graphs can be useful whenever we wish to illustrate the value some statistic takes for a variety of groups as well as for visualizing the relative proportions of nominal or ordinally measured data.

    For an example of barplots, we turn now to the other example data set from this chapter, on health lobbying in the 50 American states. Lowery et al. We can recreate that figure in R by taking the means of these eight variables and then applying the barplot function to the set of means. First we must load the data. To do this, download Lowery et al. Again, you may need to use setwd to point R to the folder where you have saved the data. Since these data are in Stata format, we must use the foreign library and then the read. To create the actual figure itself, we can create a subset of our data that only includes the eight predictors of interest and then use the apply function to obtain the mean of each variable.

    In this case, part. On the last line, the apply command allows us to take a matrix or data frame part. The 2 that is the second component of this command therefore tells apply that we want to apply mean to the columns of our data. By contrast, an argument of 1 would apply to the rows. Row-based computations would be handy if we needed to compute some new quantity for each of the 50 states. If we simply type lobby.

    To set up our figure in advance, we can attach an English-language name to each quantity that will be reported in our figures margin. We do this with the names command, and then assign a vector with a name for each quantity. The results are plotted in Fig. The first line calls the par command, which allows the user to change a wide array of defaults in the graphing space. In our. Health Advocacy In general, the margins are listed as bottom, left, top, then right.

    Anything adjusted with par is reset to the defaults after the plotting window or device, if writing directly to a file is closed. Next, we actually use the barplot command. The main argument is lobby. The default for barplot is to draw a graph with vertical lines. We also use the options cex. Finally, we use the text command to print the mean for each lobby registration rate at the end of the bar. The text command is useful any time we wish to add text to a graph, be these numeric values or text labels.

    This command takes x coordinates for its position along the horizontal axis, y coordinates for its position along the vertical axis, and labels values for the text to print at each spot. We turn now to plot, the workhorse graphical function in the base package. The plot command lends itself naturally to bivariate plots. To see the total sum of arguments that one can call using plot, type args plot. Obviously there is a lot going on underneath the generic plot function.

    For the purpose of getting started with figure creation in R we want to ask what is essential. The answer is straightforward: one variable x must be specified. Everything else has either a default value or is not essential. To start experimenting with plot, we continue to use the state health lobbying data loaded in Sect. With plot, we can plot the variables separately with the command plot varname , though this is definitively less informative than the kinds of. Lobby Participation Rate 40 30 20 10 0. That said, if we simply wanted to see all of the observed values of the lobby participation rate by state of health finance firms partratebusness , we simply type: plot health.

    Figure 3. Note that this figure plots the lobby participation rate against the row number in the data frame: With cross-sectional data this index is essentially meaningless. By contrast, if we were studying time series data, and the data were sorted on time, then we could observe how the series evolves over time. Note that we use the ylab option because otherwise the default will label our vertical axis with the tacky-looking health. Try it, and ask yourself what a journal editor would think of how the output looks.

    Of course, we are more often interested in bivariate relationships. This produces Fig. This graph shows what appears to be a decrease in the participation rate as the number of firms rises, perhaps in a curvilinear relationship. One useful tool is to plot the functional form of a bivariate model onto the scatterplot of the two variables. In the case of Fig. To do this, we can fit two linear regression models, one that includes a linear function of number of firms, and the other that includes a quadratic function.

    Additional details on regression models are discussed later on in Chap. Our two models in this case are: finance. The lm linear model command fits our models, and the summary command summarizes our results. Again, details of lm will be discussed in Chap. With the model that is a linear function of number of firms, we can simply feed the name of our fitted model finance.

    As mentioned before, the abline command is particularly flexible. A user can specify a as the intercept of a line and b as the slope. A user can specify h as the vertical-axis value where a horizontal line is drawn, or v as the horizontal-axis value where a vertical line is drawn. Or, in this case, a regression model with one predictor can be inserted to draw the best-fitting regression line. The results are presented in Fig. Alternatively, we could redraw this plot with the quadratic relationship sketched on it. Unfortunately, despite ablines flexibility, it cannot draw a quadratic Lobby Participation Rate The easiest way to plot a complex functional form is to save the predicted values from the model, reorder the data based on the predictor of interest, and then use the lines function to add a connected line of all of the predictions.

    Be sure the data are properly ordered on the predictor, otherwise the line will appear as a jumbled mess. This outcome is presented in Fig. I means as is, so it allows us to compute a mathematical formula on the fly. After redrawing our original scatterplot, we estimate our quadratic model and save the fitted values to our data frame as the variable quad. On the fourth line, we reorder our data frame health. This is done by using the order command, which lists vector indices in order of increasing value.

    Finally, the lines command takes our predicted values as the vertical coordinates y and our values of the number of firms as the horizontal coordinates x. This adds the line to the plot showing our quadratic functional form. So far, our analyses have relied on the plot default of drawing a scatterplot. In time series analysis, though, a line plot over time is often useful for observing the properties of the series and how it changes over time.

    Further information on this is available in Chap. Returning to the data on television news coverage of energy policy first raised in Sect. In this case, we have turned off the axes because the default tick marks for month are not particularly meaningful. Instead, we use the axis command to insert a label for the first month of the year every 3 years, offering a better sense of real time.

    Notice that in our first call to axis, we use the cex. This allows all five labels to fit in the graph. By trial and error, you will see that R drops axis labels that will not fit rather than overprint text. Finally, we use abline to show the zero point on the vertical axis, since this is a meaningful number that reflects the complete absence of energy policy coverage in television news.

    As our earlier figures demonstrated, we see much more variability and a higher mean after the first 4 years. Again, the data are sorted, so only one variable is necessary. Having tried our hand with plots from the base package, we will now itemize in detail the basic functions and options that bring considerable flexibility to creating figures in R.

    Bear in mind that R actually offers the useful option of beginning with a blank slate and adding items to the graph bit-by-bit. The Coordinate System: In Fig. But often, you will want to establish the dimensions of the figure before plotting anything especially if you are building up from the blank canvas. The most important point here is that your x and y must be of the same length. This is perhaps obvious, but missing data can create difficulties that will lead R to balk.

    Plot Types: We now want to plot these series, but the plot function allows for different types of plots. Also called a spike plot. Axes: It is possible to turn off the axes, to adjust the coordinate space by using the xlim and ylim options, and to create your own labels for the axes. Style: There are a number of options to adjust the style in the figure, including changes in the line type, line weight, color, point style, and more.

    See Fig. Similarly, cex. Graphing Parameters: The par function brings added functionality to plotting in R by giving the user control over the graphing parameters. One noteworthy feature of par is that it allows you to plot multiple calls to plot in a single graphic. Be careful, though. Any time you use this strategy, include the xlim and ylim commands in each call to make sure the graphing space stays the same.

    Also be careful that graph margins are not changing from one call to the next. There are also a number of add-on functions that one can use once the basic coordinate system has been created using plot. These include: arrows x1, y1, x2, y2 Create arrows within the plot useful for label- ing particular data points, series, etc.

    Set the side to 1 for bottom, 2 for left, 3 for top, and 4 for right. This lets you add an axis label to one of the sides with more control over how the label is presented.


    Weekly Posts About Data Science for Political Analysis | The Crosstab by G. Elliott Morris

    See the code that produces Fig. As an alternative to the base graphics package, you may want to consider the lattice add-on package. These produce trellis graphics from the S language, which tend to make better displays of grouped data and numerous observations. To start, the first time we use the lattice library, we must install it. Then, on every reuse of the package, we must call it with the library command. Also, both variables are listed together in a single argument using the form, vertical. By default lattice colors results cyan in order to allow readers to easily separate data information from other aspects of the display, such as axes and labels Becker et al.

    Also, by default, xyplot prints tick marks on the third and fourth axes to provide additional reference points for the viewer. Percent of Total 30 0. The lattice package also contains functions that draw graphs that are similar to a scatterplot, but instead use a rank-ordering of the vertical axis variable. This is how the stripplot and dotplot commands work, and they offer another view of a relationship and its robustness. The dotplot command may be somewhat more desirable as it also displays a line for each rank-ordered value, offering a sense that the scale is different.

    The stripplot function uses similar syntax. Lastly, the lattice library again gives us an option to look at the distribution of a single variable by plotting either a histogram or a density plot. Returning to the presidential time series data we first loaded in Sect. This is presented in Fig. This output shows points scattered along the base, each representing the value of an observation. The smoothed line across the graph represents the estimated relative density of the variables values.

    This is printed in Fig. The default again is for cyan-colored bars. A final interesting feature of histogram is left to the reader: The func- tion will draw conditional histogram distributions. If you still have the post. A final essential point is a word on how users can export their R graphs into a desired word processor or desktop publisher. The first option is to save the screen output of a figure.

    On Mac machines, user may select the figure output window and then use the dropdown menu File! Save As. On Windows machines, a user can simply right-click on the figure output window itself and then choose to save the figure as either a metafile which can be used in programs such as Word or as a postscript file for use in LATEX. Also by right- clicking in Windows, users may copy the image and paste it into Word, PowerPoint, or a graphics program.

    A second option allows users more precision over the final product. Specifically, the user can write the graph to a graphics device, of which there are several options. For example, in writing this book, I exported Fig. The first line calls the postscript command, which created a file called lin. Among the key options in this command are width and height, each of which I set to three inches. The pointsize command shrank the text and symbols to neatly fit into the space I allocated. The horizontal command changes the orientation of the graphic from landscape to portrait orientation on the page.

    Change it to TRUE to have the graphic adopt a landscape orientation. Once postscript was called, all graphing commands wrote to the file and not to the graphing window. Hence, it is typically a good idea to perfect a graph before writing it to a graphics device. Thus, the plot and abline commands served to write all of the output to the file. Once I was finished writing to the file, the dev. Of course postscript graphics are most frequently used by writers who use the desktop publishing language of LATEX.

    Writers who use more traditional word processors such as Word or Pages will want to use other graphics devices. Be sure to type? As a special circumstance, graphs drawn from the lattice package use a different graphics device, called trellis. It is technically possible to use the other graphics devices to write to a file, but unadvisable because the device options e.

    The first argument of the trellis. Besides postscript, the author can use jpeg, pdf, or png. The second argument lists the file to write to. Font and character size must be set through the theme option, and the remaining arguments declare the other preferences about the output. This chapter has covered univariate and bivariate graphing functions in R.

    Several commands from both the base and lattice packages have been addressed. This is far from an exhaustive list of Rs graphing capabilities, and users are encouraged to learn more about the available options. This primer should, however, serve to introduce users to various means by which data can be visualized in R. With a good sense of how to get a feel for our datas attributes visually, the next chapter turns to numerical summaries of our data gathered through descriptive statistics.

    In addition to their analysis of energy policy coverage introduced in this chapter, Peake and Eshbaugh-Soha also study drug policy coverage. These data similarly count the number of nightly television news stories in a month focusing on drugs, from January to December Their data is saved in comma- separated format in the file named drugCoverage. Download their data from the Dataverse named on page vii or the chapter content link on page The variables in this data set are: a character-based time index showing month and year.

    Year , news coverage of drugs drugsmedia , an indicator for a speech on drugs that Ronald Reagan gave in September rwr86 , an indicator for a speech George H. Bush gave in September ghwb89 , the presidents approval rating approval , and the unemployment rate unemploy. Draw a histogram of the monthly count of drug-related stories.

    You may use either of the histogram commands described in the chapter. Draw two boxplots: One of drug-related stories and another of presidential approval. How do these figures differ and what does that tell you about the contrast between the variables? Draw two scatterplots: a In the first, represent the number of drug-related stories on the vertical axis, and place the unemployment rate on the horizontal axis.

    What do they tell you about the data? Draw two line graphs: a In the first, draw the number of drug-related stories by month over time. Load the lattice library and draw a density plot of the number of drug-related stories by month. Bonus: Draw a bar graph of the frequency of observed unemployment rates. Hint: Try using the table command to create the object you will graph. Can you go one step further and draw a bar graph of the percentage of time each value is observed?

    Chapter 4 Descriptive Statistics. Before developing any models with or attempting to draw any inferences from a data set, the user should first get a sense of the features of the data. This can be accomplished through the data visualization methods described in Chap. Ideally, the user will perform both tasks, regardless of whether the results become part of the final published product. A traditional recommendation to analysts who estimate functions such as regression models is that the first table of the article ought to describe the descriptive statistics of all input variables and the outcome variable.

    While some journals have now turned away from using scarce print space on tables of descriptive statistics, a good data analyst will always create this table for him or herself.


    Frequently this information can at least be reported in online appendices, if not in the printed version of the article. As we work through descriptive statistics, the working example in this chapter will be policy-focused data from LaLondes analysis of the National Sup- ported Work Demonstration, a s program that helped long-term unemployed individuals find private sector jobs and covered the labor costs of their employment for a year.

    The variables in this data frame are: treated: Indicator variable for whether the participant received the treatment. Springer International Publishing Switzerland 53 J. Changing Variable Classes. Adding or Modifying Variable Labels. Collapsing Variables into Simplified Categories. Centering or Standardizing a Numeric Variable. Creating an Additive Index. Cross-Tabulations and Mosaic Plots. Line Charts. Mean Comparison Analysis. Box Plots. Strip Charts. Cross-Tabulation Analysis with a Control Variable. Multiple Line Charts. The legend Function. Mean Comparison Analysis with a Control Variable.

    Testing Hypothetical Claims about the Population Mean. Making Inferences about Two Sample Means. Making Inferences about Two Sample Proportions. Analyzing an Ordinal-Level Relationship. Correlation Analysis. Bivariate Regression with a Dummy Variable. Multiple Regression Analysis. Multiple Regression with Ordinal or Categorical Variables. Weighted Regression with a Dummy Variable. Multiple Regression Analysis with Weighted Data. Creating Tables of Regression Results. Visualizing Correlation. General Comments about Visualizing Regression Results.

    Plotting Multiple Regression Results. Interaction Effects in Multiple Regression. Visualizing Regression Results with Weighted Data. Thinking about Odds, Logged Odds, and Probabilities. Estimating Logistic Regression Models.