Census based cluster-sample in India¶
Anurag Ajay, CIMMYT
Introduction¶
In this chapter we show how one might approach census-based cluster sampling. The actual implementation will depend on the (accessibility) of the data that is available in a country. In the this example, we develop a cluster sample for India.
This approach was developed by CSISA, in collaboration with the district-level extension wing of Indian Council of Agriculture Research (ICAR), the Krishi Vigyan Kendra (KVK). The samples are used for a a survey of production practices in wheat and rice.
To capture as much as possible of the variation between farmers, the sample (of respondents) should be reprentative for the entire study area. The district (there are about 640 districts in India) is used as the main survey unit. In the first stage, villages are selected within each district, using a random sampling approach where the probability of being selected is proprtional to the number of households in a village. After that a number of random households are selected within each selected villages. More households are sampled than are necessary, because it may not be possible to find all households, or they may not want to respond. Also, some households have no farming activities.
With that said, the target was to select 7 households in 30 villages in each district.
Get data¶
Lists of villages by district can be donwloaded from a website of the 2011 [Population Census of India] (http://censusindia.gov.in/2011-Common/CensusData2011.html). Here we use a file that has the villages in Pashchim Champaran district in Bihar State.
x <- read.csv("PC_2011.csv", stringsAsFactors = FALSE)
head(x)
## State District Subdistt Town.Village Level Name TRU
## 1 10 203 0 0 DISTRICT Pashchim Champaran Total
## 2 10 203 0 0 DISTRICT Pashchim Champaran Rural
## 3 10 203 0 0 DISTRICT Pashchim Champaran Urban
## 4 10 203 1013 0 SUB-DISTRICT Sidhaw Total
## 5 10 203 1013 0 SUB-DISTRICT Sidhaw Rural
## 6 10 203 1013 0 SUB-DISTRICT Sidhaw Urban
## No_HH
## 1 710461
## 2 637354
## 3 73107
## 4 55436
## 5 55436
## 6 0
dim(x)
## [1] 1570 8
table(x$Level)
##
## DISTRICT SUB-DISTRICT TOWN VILLAGE WARD
## 3 54 8 1365 140
The table shows that there are three records for the District level (Rural, Urban and Total population of Pashchim Champaran), 54 for the Sub-district level (thus there are 54/3=18 subdistricts), 1365 for the village level, etc.
I - Population weighted sample of villages¶
We are interested in the rural villages.
villages <- x[x$Level == "VILLAGE" & x$TRU == "Rural", ]
dim(villages)
## [1] 1365 8
That is the same number as we saw for villages above. So all villages are rural.
Now select 40 villages
set.seed(352020)
samplesize <- 40
i <- sample(nrow(villages), samplesize, prob=villages$No_HH)
sort(i)
## [1] 9 56 223 230 235 375 491 561 614 629 655 662 715 819 824
## [16] 829 849 897 905 910 947 948 970 1019 1022 1066 1075 1096 1101 1103
## [31] 1129 1189 1192 1224 1260 1269 1281 1294 1305 1324
sel_villages <- villages[i, ]
sort(sel_villages$Name)
## [1] "Anjua" "Balua" "Bankatwa" "Banu Chhapra"
## [5] "Bhana Chak" "Bhawanipur" "Bheriharwa" "Chanainbandh"
## [9] "Chuhari" "Dharhwa" "Dudhaura" "Dumaria"
## [13] "Ekderwa" "Jaitiya" "Kataha" "Khora Parsa"
## [17] "Lagunaha" "Machharganwa" "Majhariya" "Marhia"
## [21] "Mathiya" "Matkota" "Meghwal" "Mehura"
## [25] "Nautanwa" "Pachrukhia" "Pipra Pakri" "Puraina Gosain"
## [29] "Rampurwa" "Sabiya khurd" "Semri" "Sirinagar"
## [33] "Siswania" "Soharia" "Taulaha" "Thakraha"
## [37] "Thakurtola" "Tola Parbatia" "Tola Utimpanre" "Turhapatti"
Change the sub-district codes with their names. First get a data.frame with the unique subdistrict code and names
subdist <- unique(x[x$Level == "SUB-DISTRICT", c("Subdistt", "Name")])
colnames(subdist)[2] <- "Subdistrict"
head(subdist)
## Subdistt Subdistrict
## 4 1013 Sidhaw
## 166 1014 Ramnagar
## 318 1015 Gaunaha
## 477 1016 Mainatanr
## 573 1017 Narkatiaganj
## 750 1018 Lauriya
Merge this data.frame with the selected villages
sel_villages <- merge(sel_villages, subdist, by="Subdistt")
head(sel_villages)
## Subdistt State District Town.Village Level Name TRU No_HH
## 1 1013 10 203 216047 VILLAGE Dudhaura Rural 312
## 2 1013 10 203 215998 VILLAGE Soharia Rural 588
## 3 1014 10 203 216252 VILLAGE Mathiya Rural 1016
## 4 1014 10 203 216238 VILLAGE Bankatwa Rural 509
## 5 1014 10 203 216246 VILLAGE Meghwal Rural 777
## 6 1015 10 203 216414 VILLAGE Pachrukhia Rural 109
## Subdistrict
## 1 Sidhaw
## 2 Sidhaw
## 3 Ramnagar
## 4 Ramnagar
## 5 Ramnagar
## 6 Gaunaha
Let’s keep the variables we want, in the order we want them.
sel_villages <- sel_villages[, c("State", "District", "Subdistt", "Town.Village", "Subdistrict", "Name", "No_HH")]
sel_villages <- sel_villages[order(sel_villages$Subdistrict, sel_villages$Name), ]
II — Houshold selection¶
Now we have the villages, we want to select households. The website of the Bihar State Election Commission provides voter lists of villages. This forms a good basis for constructing a sampling frame.
A complication is that the voter lists are available as pdf files. For
our district, the names are in two files. We can use the pdftools
package to read pdf files. To illusrate that:
#library(pdftools)
voterfile <- "Bariyarpur-1.pdf"
# read the file
s <- pdftools::pdf_text(voterfile)
class(s)
## [1] "character"
length(s)
## [1] 20
s
is a character vector of length 20. Each element corresponds to a
page in the pdf file. The trick is now to extract the information we
need. The code below searches for patterns in the text (the house number
and the family name).
housepattern = "गतह सपखयच : "
namepattern = "ननरचरचक कच नचम : "
ss <- trimws(unlist(strsplit(s, "\r\n")))
i <- grep(paste0("^", namepattern), ss)
si <- trimws(unlist(strsplit(ss[i], namepattern)))
j <- grep(paste0("^", housepattern), ss)
sj <- trimws(unlist(strsplit(ss[j], housepattern)))
# this should be TRUE
(length(si) == length(sj))
## [1] TRUE
x <- cbind(sj, si)
x <- x[x[,1] != "", ]
colnames(x) <- c("household", "name")
We can use these pattenrs to extract the data we need from the two files; and combine the results.
hn <- unique(x)
head(hn)
## household name
Randomly select 15 house numbers (if possible link one member name against selected house number)
uhh <- unique(hn[, "household"])
head(uhh)
## character(0)
#hns <- sample(uhh, 15)
#hns
Get the selected hh and names
#x <- hn[hn[, "household"] %in% hns, ]
#y <- tapply(x[,2], x[,1], function(i) paste(i, collapse=", "))
#z <- cbind(house=names(y), names=as.vector(y))
Some of the results
#knitr::kable(z[1:5,])
Write .csv with two columns – house number and member number
#write.csv(z, "selection.csv", row.names = FALSE)