Census based cluster-sample in India

Anurag Ajay, CIMMYT

Introduction

In this chapter we show how one might approach census-based cluster sampling. The actual implementation will depend on the (accessibility) of the data that is available in a country. In the this example, we develop a cluster sample for India.

This approach was developed by CSISA, in collaboration with the district-level extension wing of Indian Council of Agriculture Research (ICAR), the Krishi Vigyan Kendra (KVK). The samples are used for a a survey of production practices in wheat and rice.

To capture as much as possible of the variation between farmers, the sample (of respondents) should be reprentative for the entire study area. The district (there are about 640 districts in India) is used as the main survey unit. In the first stage, villages are selected within each district, using a random sampling approach where the probability of being selected is proprtional to the number of households in a village. After that a number of random households are selected within each selected villages. More households are sampled than are necessary, because it may not be possible to find all households, or they may not want to respond. Also, some households have no farming activities.

With that said, the target was to select 7 households in 30 villages in each district.

Get data

Lists of villages by district can be donwloaded from a website of the 2011 [Population Census of India] (http://censusindia.gov.in/2011-Common/CensusData2011.html). Here we use a file that has the villages in Pashchim Champaran district in Bihar State.

x <- read.csv("PC_2011.csv", stringsAsFactors = FALSE)
head(x)
##   State District Subdistt Town.Village        Level               Name   TRU
## 1    10      203        0            0     DISTRICT Pashchim Champaran Total
## 2    10      203        0            0     DISTRICT Pashchim Champaran Rural
## 3    10      203        0            0     DISTRICT Pashchim Champaran Urban
## 4    10      203     1013            0 SUB-DISTRICT             Sidhaw Total
## 5    10      203     1013            0 SUB-DISTRICT             Sidhaw Rural
## 6    10      203     1013            0 SUB-DISTRICT             Sidhaw Urban
##    No_HH
## 1 710461
## 2 637354
## 3  73107
## 4  55436
## 5  55436
## 6      0
dim(x)
## [1] 1570    8
table(x$Level)
##
##     DISTRICT SUB-DISTRICT         TOWN      VILLAGE         WARD
##            3           54            8         1365          140

The table shows that there are three records for the District level (Rural, Urban and Total population of Pashchim Champaran), 54 for the Sub-district level (thus there are 54/3=18 subdistricts), 1365 for the village level, etc.

I - Population weighted sample of villages

We are interested in the rural villages.

villages <- x[x$Level == "VILLAGE" & x$TRU == "Rural", ]
dim(villages)
## [1] 1365    8

That is the same number as we saw for villages above. So all villages are rural.

Now select 40 villages

set.seed(352020)
samplesize <- 40
i <- sample(nrow(villages), samplesize, prob=villages$No_HH)
sort(i)
##  [1]    9   56  223  230  235  375  491  561  614  629  655  662  715  819  824
## [16]  829  849  897  905  910  947  948  970 1019 1022 1066 1075 1096 1101 1103
## [31] 1129 1189 1192 1224 1260 1269 1281 1294 1305 1324
sel_villages <- villages[i, ]
sort(sel_villages$Name)
##  [1] "Anjua"          "Balua"          "Bankatwa"       "Banu Chhapra"
##  [5] "Bhana Chak"     "Bhawanipur"     "Bheriharwa"     "Chanainbandh"
##  [9] "Chuhari"        "Dharhwa"        "Dudhaura"       "Dumaria"
## [13] "Ekderwa"        "Jaitiya"        "Kataha"         "Khora Parsa"
## [17] "Lagunaha"       "Machharganwa"   "Majhariya"      "Marhia"
## [21] "Mathiya"        "Matkota"        "Meghwal"        "Mehura"
## [25] "Nautanwa"       "Pachrukhia"     "Pipra Pakri"    "Puraina Gosain"
## [29] "Rampurwa"       "Sabiya khurd"   "Semri"          "Sirinagar"
## [33] "Siswania"       "Soharia"        "Taulaha"        "Thakraha"
## [37] "Thakurtola"     "Tola Parbatia"  "Tola Utimpanre" "Turhapatti"

Change the sub-district codes with their names. First get a data.frame with the unique subdistrict code and names

subdist <- unique(x[x$Level == "SUB-DISTRICT", c("Subdistt", "Name")])
colnames(subdist)[2] <- "Subdistrict"
head(subdist)
##     Subdistt  Subdistrict
## 4       1013       Sidhaw
## 166     1014     Ramnagar
## 318     1015      Gaunaha
## 477     1016    Mainatanr
## 573     1017 Narkatiaganj
## 750     1018      Lauriya

Merge this data.frame with the selected villages

sel_villages <- merge(sel_villages, subdist, by="Subdistt")
head(sel_villages)
##   Subdistt State District Town.Village   Level       Name   TRU No_HH
## 1     1013    10      203       216047 VILLAGE   Dudhaura Rural   312
## 2     1013    10      203       215998 VILLAGE    Soharia Rural   588
## 3     1014    10      203       216252 VILLAGE    Mathiya Rural  1016
## 4     1014    10      203       216238 VILLAGE   Bankatwa Rural   509
## 5     1014    10      203       216246 VILLAGE    Meghwal Rural   777
## 6     1015    10      203       216414 VILLAGE Pachrukhia Rural   109
##   Subdistrict
## 1      Sidhaw
## 2      Sidhaw
## 3    Ramnagar
## 4    Ramnagar
## 5    Ramnagar
## 6     Gaunaha

Let’s keep the variables we want, in the order we want them.

sel_villages <- sel_villages[, c("State", "District", "Subdistt", "Town.Village", "Subdistrict", "Name", "No_HH")]
sel_villages <- sel_villages[order(sel_villages$Subdistrict, sel_villages$Name), ]

II — Houshold selection

Now we have the villages, we want to select households. The website of the Bihar State Election Commission provides voter lists of villages. This forms a good basis for constructing a sampling frame.

A complication is that the voter lists are available as pdf files. For our district, the names are in two files. We can use the pdftools package to read pdf files. To illusrate that:

#library(pdftools)
voterfile <- "Bariyarpur-1.pdf"
# read the file
s <- pdftools::pdf_text(voterfile)
class(s)
## [1] "character"
length(s)
## [1] 20

s is a character vector of length 20. Each element corresponds to a page in the pdf file. The trick is now to extract the information we need. The code below searches for patterns in the text (the house number and the family name).

    housepattern = "गतह सपखयच : "
    namepattern = "ननरचरचक कच नचम : "
    ss <- trimws(unlist(strsplit(s, "\r\n")))
    i <- grep(paste0("^", namepattern), ss)
    si <- trimws(unlist(strsplit(ss[i], namepattern)))
    j <- grep(paste0("^", housepattern), ss)
    sj <- trimws(unlist(strsplit(ss[j], housepattern)))
    # this should be TRUE
    (length(si) == length(sj))
## [1] TRUE
    x <- cbind(sj, si)
    x <- x[x[,1] != "", ]
    colnames(x) <- c("household", "name")

We can use these pattenrs to extract the data we need from the two files; and combine the results.

hn <- unique(x)
head(hn)
##      household name

Randomly select 15 house numbers (if possible link one member name against selected house number)

uhh <- unique(hn[, "household"])
head(uhh)
## character(0)
#hns <- sample(uhh, 15)
#hns

Get the selected hh and names

#x <- hn[hn[, "household"] %in% hns, ]
#y <- tapply(x[,2], x[,1], function(i) paste(i, collapse=", "))
#z <- cbind(house=names(y), names=as.vector(y))

Some of the results

#knitr::kable(z[1:5,])

Write .csv with two columns – house number and member number

#write.csv(z, "selection.csv", row.names = FALSE)