The analysis of geospatial information is currently a big trend in medicine and public health. Even though some may want to convince you that this can only be achieved with the latest and most expensive software, I am not convinced. First, analysis  of spatial data dates back to at least 1856 when John Snow investigated Cholera-outbreaks in London. Second, as I try to demonstrate today some very interesting analysis and data can be retreived essentially for free.

While I have already made a post on how to plot freely availible geospatial data in R in a previous post , this post will show you how to use Python to access the google maps database and gather e.g. travel times and distances to/from various locations with known zip-codes.

Please note that this is my first Python skript. So it will certainly not meet the high standards you might have developed based on previous posts. On the up-side, you will get the baby step instructions.

Update 2011/07/03: A much more user-friendly version of the script that adds guis to select a proper csv-file, containing start and end-adressess and to store the results can be found here. If you are afraid of Python, you can use the stand-alone Mac app “batchtimer” that basically contains all files necessary from here.

A. Installing Python

  1. Download and install Python and the Python setuptools package so that you can use easy_install.

  2. Install the google directions package: Just type easy_install google.directions

B. Run the skript

The complete skript aswell as an example file with zip_codes can be downloaded here.

Here is a bit more thorough description of what it does. Parts you may want to change are marked in bold. Basically the skript consists of four parts.

1. Load the necessary packages and set-up (you need a google directions key).

import csv from google.directions import GoogleDirections gd = GoogleDirections("your-google-directions-key")

2. Read zip-codes from a file

Here is  example looks like this:

zip_codes = csv.reader(open**('/zips.csv**', "rb"), delimiter=' ', quotechar='|') zips=list(zip_codes)

3. Loop through the list of zips

times=[] miles=[] for i in range(len(zips)): start= (str(zips[i])  +", **Germany**") end= ("**BERLIN**," + "**Germany**") res = gd.query(start, end) temp=res.result["Directions"]["Duration"]["seconds"] times.append(temp) miles.append(res.distance) print i

Please check if the distance is given in miles or km!

4. Write the output

out = csv.writer(open('/**results**.csv', 'wb'), delimiter=';',quotechar='X', quoting=csv.QUOTE_MINIMAL) for i in range(len(times)): out.writerow(str(zips[i])+ " " + str(times[i]) + " " + str(miles[i]))

Sometimes when working with small paired data-sets it is nice to see/show all the data in a structured form. For example when looking at pre-post comparisons, connected dots are a natural way to visualize which data-points belong together. In R this can be easily be combined with boxplots expressing the overall distribution of the data.  This also has the advantage of beeing more true to non-normal data that is not correctly represented by means +/- 95%CI. I have not come up against a good tutorial of how to do such a plot (although the right hand plot borrows heavily from this post on the excellent R-mailing list), so in the post you will find the code to generate such a graph in R.

Here comes the code (Update 05.12.2011 without umlauts):

#generatingsomedata
pre<-55+rnorm(20)
post<-pre+0.7+rnorm(20)

#Settinguptwoscreens
par(mfrow=c(1,2))

#FirstGraph
s<-seq(length(pre))
par(bty="l")
boxplot(pre,post,main="Rawdata",xlab="Time",ylab="Measure",names=c("pre","post"),col=c("lightblue","lightgreen"))
stripchart(list(pre,post),vertical=T,pch=16,method="jitter",cex=0.5,add=T)
segments(rep(0.95,length(pre))[s],pre[s],rep(2,length(pre))[s],post[s],col=1,lwd=0.5)
#Secondgraph
#Confidenceintervals eitherparametric (t.test) or non-parametric (wilcox.text)
#res<-t.test(post,prä,paired=T,conf.int=T)
res<-wilcox.test(post,pre,paired=T,conf.int=T)

stripchart(post-pre,vertical=T,pch=16,method="jitter",main="Difference",ylab="Difference:Post–Prä",xlab="Median+/-95%CI")
points(1,res$estimate,col="red",pch=16,cex=2)
arrows(1,res$conf.int[1],1,res$conf.int[2],col="red",code=3,lwd=3,angle=90)
abline(h=0,lty=2)#Zero-effectline

RStudio is a graphical user interface for R. Or as the developers put it.

RStudio™ is a new integrated development environment (IDE) for R. RStudio combines an intuitive user interface with powerful coding tools to help you get the most out of R. [![](http://www.surefoss.org/wp-content/uploads/2011/05/rstudio-300x170.jpg)](http://www.surefoss.org/wp-content/uploads/2011/05/rstudio.jpg)

While there have been a few projects (e.g. RCommander, RkWard, JaguaR) RStudio is the first I will probably integrate into my workflow - the mac-gui I work with is already great and has some essential features like syntax-highlighting out of the box, but I will recommend RStudio to anyone considering to start working with R - and anyone else asking me about statistics.

I just want to highlight two features which can change the learning curve for R.

  1. Getting data into R. RStudio has a nice import dataset feature that can be used to read text-files. Something that can be really frustrating in the beginning.

  2. Navigating the data. By just clicking aI really hope that this will stay a read-only feature, because everything else is simply not the way to go.

Cons:

  • Umlauts are not yet integrated, but it seems like a matter of time with these guy/is.

[en]Every now and then we want to work with others who are less inclined to working with the console to use some bash files. As most of the skripts we make, only take a few options, we might think of adding a nice gui to the bash-file so that others are more likely to use it. As always, this can be done very quickly and for free in Linux.

All you need is to install the programm zenity and then write a script, that calls zenity to set the value of a parameter.

sudo apt-get install zenity

Then your next “program” is only a couple of lines away.

My first “project” - I will propably laugh about this one in a couple of weeks - was a little GUI that should replace a textstring “999999”  in a textfile with a string that is entered via the gui and save the result with a new filename. You only need to save the following code in a bash-file, e.g. “replace.sh” and make it executable.

#!/bin/bash cd /home/vikp/Desktop/HLT var=$(zenity --entry --text "Enter the replacement text" --entry-text "") sed s/9999999/$VP_name/g *.txt > newname_$var.dat

The reason I needed this was that I wanted a wrapper for psytoolkit - the brialliant linux program for reaction-time experiments - that asks for a participant codes and adds this code to the result file. The complete wrapper for psytoolkit now looks like this:

#!/bin/bash cd /home/vikp/Desktop/HLT VP_name=$(zenity --entry --text "Participant-Code" --entry-text "") ./experiment sed s/9999999/$VP_name/g hlt_5.psy.*.data > HLT_VP_$VP_name.dat[/en]

[de] Von Zeit zu Zeit möchten wir mit Leuten zusammen areiten, welche weniger gerne mit der Tastatur arbeiten um Bash-Files zu nutzen. Wir könnten dem Bash-File eine schöne GUI hinzufügen, damit es von anderen wahrscheinlicher genutzt wird. Wie immer kann dies sehr schnell und kostenlos in Linux realisiert werden.

Alles was tu tun musst ist das Programm Zenity zu installieren und anschließend ein Skript zu schreiben, welches Zenity dazu auffordert, die Parameterwerte zu setzen.

sudo apt-get install zenity

Dann ist dein nächstes “Programm” nur wenige Zeilen entfernt.

Mein erstes “Project” (vermutlich werde ich in einigen Wochen darüber lachen) war ein kleiner GUI, durch welchen der Textstring “999999” in eine Textdatei übersetzt werden sollte. Dieser wurde mit Hilfe des GUI eingegeben und das Ergebnis unter einem neuen Dateinamen gespeichert. Du brauchst dafür lediglich den folgenden Code in einem Bash-File zu speichern, z.B. “replace.sh” und es dann ausführbar machen.

#!/bin/bash cd /home/vikp/Desktop/HLT var=$(zenity --entry --text "Enter the replacement text" --entry-text "") sed s/9999999/$VP_name/g *.txt > newname_$var.dat

Ich brauchte das für eine Verpackung für Psytoolkit- das brilliante linux-Programm für Rekationszeitexperimente. Psytoolkit verlang einen Teilnehmercode und fügt diesen Code der Ergebnisdatei hinzu. Die vollständige Verpackung für Psytoolkit sieht folgendermaßen aus:

#!/bin/bash cd /home/vikp/Desktop/HLT VP_name=$(zenity --entry --text "Participant-Code" --entry-text "") ./experiment sed s/9999999/$VP_name/g hlt_5.psy.*.data > HLT_VP_$VP_name.dat[/de]

[en]Many different tools are available for online process data collection.

Flashlight is such an example, it is an open-source, web-based software package that can be used to collect continuous and non-obtrusive measures of users’ information acquisition behavior. Flashlight offers a cost effective and rapid way to collect data on how long and how often a participant reviews information in different areas of visual stimuli. It provides the functionality of other open source process tracing tools, like MouselabWeb, and adds the capability to present any static visual stimulus.

Flashlight is based on the idea that the original stimulus is overlaid with a blurred version. Around the mouse cursor a ‘focus area’ reverses this blurring and shows the original, un-blurred version of the stimulus. Using the mouse a participant can explorer the stimulus easily. Length and frequency of fixation are calculated from the collected data and can be downloaded for further analysis from the available scripts.[/en]

[de]Zur Datensammlung der Online-Verarbeitung gibt es viele verschiedene Tools.

Flashlight ist ein Beispiel davon. Es ist eine open-source, web basierte Software, welche genutzt werden kann um fortlaufende und unaufdringliche Daten bezüglich der Informationsaneignung von Usern zu gewinnen. Flashlight bietet eine kosteneffektive und schnelle Art der Datensammlung über die Dauer und Häufigkeit, die ein Teilnehmer Informationen in verschiedenen Bereichen von visuellen Stimuli bespricht. Es beinhaltet die Funktionalität anderer open-source Process-tracing-Tools, wie MouselabWeb, und zusätzlich die Fähigkeit, statische visuelle Stimuli zu präsentieren.

Flashlight basiert auf der Idee, dass der ursprüngliche Stimulus von einer verschwommenen Version überlagert wird. Um den Maus-Cursor herum gibt es eine “Focus Area”, welche diese Unschärfe umdreht und die ursprüngliche, scharfe Version des Stimulus zeigt. Durch die Nutzung der Maus kann der Teilnehmer den Stimulus leicht entdecken. Länge und Frequenz der Fixation werden auf Basis der gesammelten Daten berechnet und für folgende Analysen von den verfügbaren Skripts heruntergeladen werden.[/de]

[en] I recently came across an excellent paper “On the Practice of Dichotomization of Quantitiative Variables” by MacCallum and colleagues (2002) . As I use ANOVAs a lot in my research, it really got me thinking about the whole issue. Even though I have no great idea for an innovative simulation study, you might have one. If you read through this post, you will notice that it’s really simple - at least the technical part.

I will only explain the two-variable scenario. But the setup is basically the same for more complex, e.g.  two-variable, setups. Let’s start with their small numerical example before turning to the simulation study.

Setting up the packages

As some of you might have noticed I try to be consistent in how I structure the R-skripts. The first step is always to load packages and set up a working directory. As we do not read in any data, the latter is ommitted. However, I want to set a specific seed for the random number generation so that the results are reproducible.

require(MASS)
require(corpcor)
set.seed(2901)     #to have reproducible results

Setting up the parameters and generating data

We will use the mvrnormfunktion from the MASS package to simulate the data. This takes three options, the sample size N, the means fof the variables mu,  and a covariance matrix sigma. In case you do not know how to translate a set of correlations between variables into a positive definitive covariance matrix, you will also need the rebuild.cov function from the corpcor packes. With these you can generate a sample of 50 participants with five lines of code.

m_x

Inspect the results

I always like names better than subscripts. So this completely unneccessary step.

names(temp)

temp$X1_d

The statistics can be compared with the following commands.

cor.test(X1, Y1)

t.test(Y1~temp$X1_d)

If you look at the p-values you will note, that both kinds of tests give you a a significant effect. If you find a seed that does not, please write me about it.

Running their small scale study

First we have to define a function that gives us the count of times where the dichotomized analysis gives us larger estimates for the correlation than the original results, given a specific correlation in the population and a sample size. I called it overshoot, because this is most likely due to sample-bias, as argued in the original paper.

I generated the following mainly by using the “extract function” feature in Rstudio.

overshoot

Inspect the results

And here are the results.

[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 4481 4129 3790 3550 3347 3042 2869
[2,] 3359 2402 1642 1087  809  592  416
[3,] 2201 1011  380  141   51   24    5
[4,]  971  154   16    3    1   NA   NA
[5,]   77    1   NA   NA   NA   NA   NA

Thanks to the large number of samples (10,000) drawn from the population, the results are very similar to the data published. Hope you liked it.

Reference

MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19–40.

[/en]

[es]

Recientemente me tope con una publicación excelente “On the Practice of Dichotomization of Quantitiative Variables” por  McCallum y colegas (2002). El articulo me llevo a reflexionar mucho sobre este tema ya que utilizo ANOVAs muy a menudo en mis investigaciones.A pesar de que no tengo ideas ingeniosas para un estudio de simulación  innovador, es posible que tu si las tengas.  Al leer esta publicación te darás cuenta que en realidad no es complicado (al menos la parte técnica).

Solo explicare el caso de dos variables. Pero el set-up es el mismo para otros Set-ups mas complejos. Comencemos con su pequeño ejemplo numérico antes de pasar a la simulación del estudio.

Organizando los paquetes

Como algunos ya habrán notado, trato de ser consistente en mi forma de organizar los R-scripts.  El primer paso es siempre cargar los paquetes y armar un directorio. Al no leer ninguna informaron, este ultimo es omitido. Sin embargo quiero establecer una semilla especifica para la generación aleatoria de números de tal manera que los resultados sean reproducibles.

require(MASS)
require(corpcor)
set.seed(2901)     #to have reproducible results

Estableciendo los parametros y generando datos

Vamos a usar la mvrnormfunktion del paquete MASS para simular los datos. Esto requiere tres opciones, el tamaño de la muestra N, la media the las variables mu y una matriz de covarianza sigma. En caso de que no sepas como traducir un grupo de correlaciones entre variables para convertirlas en una matriz de covarianza positiva definitiva, necesitaras ademas la funcion rebuild.cov de los paquetes corpcor. Con estos puedes generar muestras de 50 participantes con 5 lineas de codigo.

m_x

Inspecciona los resultados

Siempre priorizo los nombres por sobre los índices. Por lo cual este paso es completamente innecesario.

names(temp)

temp$X1_d

Las estadísticas pueden ser comparadas con los siguientes comandos.

cor.test(X1, Y1)

t.test(Y1~temp$X1_d)

Si prestas atencions a los dos valores-p, notaras que ambos tipos de test arrojan un efecto significante. Si encuentras una semilla que no lo hace, por favor comunicamelo.

Haciendo su experimento a pequeña escala

Primero debemos definir una función que nos de el numero de veces que el análisis dicotómico nos otorga estimaciones para la correlación mas  grandes que los resultados originales, dada una correlación especifica en la población y la muestra. Yo lo llamo “overshoot” (pasarse), dado que esto se debe muy probablemente a sesgos en la muestra, tal como fue discutido en la publicación original.

Utilizando principalmente la opción  “extract function” de Rstudio, he generado los siguiente:

overshoot

Inspeccionar los resultados

He aqui los resultados.

[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 4481 4129 3790 3550 3347 3042 2869
[2,] 3359 2402 1642 1087  809  592  416
[3,] 2201 1011  380  141   51   24    5
[4,]  971  154   16    3    1   NA   NA
[5,]   77    1   NA   NA   NA   NA   NA

Los resultados son muy similares a la información publicada gracias al gran numero de muestras (10,000) extraídas de la población. Espero que esta publicación haya sido de tu agrado.

Referencia

MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19–40.

[/es]

[de]Ein Problem, wenn man eine Webseite ins Netz stellt, ist, dass man nie genau weiß, wie die Seite ankommt.

Um das heraus zu finden gibt es natürlich auch ein verlockend einfaches Tool von Google, welches das auch sehr schön macht. Der Nachteil ist allerdings, dass wieder alle Daten in die Krake geschmissen werden müssen, ehe man selbst einen kleinen Teil der Informationen zurückbekommt.

Eine Alternative dazu ist piwik. Mit diesem Open-Source Tool, kann man selbst (= auf dem eigenen Server) die Besucherdaten erfassen, speichern und auswerten.

Die Installation ist genau so einfach wie die von Wordpress, d.h.

  1. Man lädt also eine ZIP-Datei runter und entpackt die Datei

  2. Man kopiert den entpackten Ordner auf seinen Lieblings-Server (mit MySQL und PHP)

  3. Man startet das Installationsskript und richtet piwik auf dem Server ein

  4. Man fügt einen kurzen Code-Ausschnitt auf jeder Seite ein, die getrackt werden soll. Alternativ gibt es für wordpress auch schon ein fertiges Plugin. [/de]

[en]One of the problems when placing a Website into the www is that you never know, how your site is recieved.

To find out, there is a simple tool by Google, which works well. But the disadvantage is that you have to throw all you data into the kraken before you get some tiny information back.

An alternative is piwik. By using this open-source tool you can detect, save and evaluate the attendance on your own server.

The installation is as easy as the installation of wordpress:

  1. Download the zip file and unpack it

  2. copy the unpacked folder to your favorite Server (with MySCL and PHP)

  3. start the script for installation and set up Piwik

  4. paste a short code on any page, that should be tracked. As an alternative, there already is a complete plugin for wordpress.[/en]

[de]

Im elektronischen Handbuch ZIS werden zahlreiche Instrumente zur Erhebung von Einstellungen und Verhaltensweisen aus häufig untersuchten sozialen Themenbereichen dokumentiert. Das Programm läuft unter Windows 3.x, Windows 95, 97, 2000, Windows NT oder XP sowie Windows 7 (32 und 64 Bit).

[/de]

[en]

ZIS is an electronic manual containing information about a great number of questionnaires used to measure attitudes and behaviours in the field of social sciences. The program supports MS Windows 3.x, Windows 95, 97, 2000, NT, XP, and 7 (32 and 64 bit).

[/en]

[de]

Mit f4 wird die Transkription von Audio- oder Videoaufnahmen stark erleichtert.  Das Abspielen und Stoppen der Aufnahmen wird über die Tastatur gesteuert (auch während in Word gearbeitet wird). Außerdem sind eine Verlangsamung der Abspielgeschwindigkeit, automatische Kurzrücksprünge und (automatische) Zeitmarken/ Textbausteine möglich. Das Programm ist verfügbar für Windows und MacOS X.

[/de]

[en]

Using f4, you can easily create a transcript of your audio- or videorecordings. f4 can be used via keyboard (even within MS Word), can slow down playback speed, can automatically jump back for a few seconds, and can (automatically) create time stamps/ text blocks. f4 is available for MS Windows and MacOS X.

[/en]

[de]

Mit Biplot lässt sich die Verteilung von Daten in einem zweidimensionalen Koordinatensystem darstellen, wobei im Gegensatz zu der sonst üblichen Darstellung auch angezeigt werden kann, wieviele Punkte auf derselben Stelle liegen. Außerdem lässt sich eine Filtervariable definieren, mit der bestimmte Werte eingefärbt werden können. Das Programm läuft auf jedem Betriebssystem, für das Java vorliegt.

[/de]

[en]

Biplot visualizes distribution of data within a twodimensional coordinate system. In contrast to the usual visualization, biplot shows how many data points are located at the same spot. Additionally, a filter variable can be created in oder to change the color of certain values. Biplot can be used on any operating system that supports Java.

[/en]