Data exploration and transformations

Handling procedures for agronomic trials data are fully described in the document entitled “Data quality control procedures for data pertaining to the diagnostic trials”. 

The document describes data quality aspects and sources of error, process of data quality control including checking for outliers and errors in data entry, data conversion and basic calculation of stover and grain yield and checking for consistency and referential integrity and  is available by downloading here.

All the data is stored in a database first developed in Filemarker ® and later translated to a web-based database using MySQL®. A full description of the database is also available as PDF here. The figure below shows the schema of this database.



Figure 2. Schema of the database used in AfSIS Objective 4

In the subsequent steps, we highlight analysis methods with occasional reference to R scripts. For specific codes and details see “Yield Analysis 11-2012.r” available here. All the codes provided are self-explanatory as comments are provided to indicate what each script is doing. The codes also provide extensive scripts for data formatting to ensure that data from the database is directly formatted for the specific analysis of interest. Most of the analysis is done on the difference of yield between a treatment and the control. Where needed, the scripts also calculate the average control and NPK treatment yield for each field, since these had been in 2 or 3 replicates.

Boxplots are constructed for each of the site/season combinations to view the spread of the data, have a first impression of treatment effects and identify possible data points that need further examination (see Figure 3). Outliers were retained in the analysis if there was evidence that the observations were real.

Plots of control versus difference of any treatment from control are constructed to check, for a given sentinel site (also for all sites), individual treatment responses at different control yields (Figure 4). As expected in many of the sites, there were relatively small differences between treatment and control yields from fields with very low or very high control yields.

Figure 3. The spread of grain yield in Nkhata Bay Malawi during the 2010-2011 cropping season. 

Figure 4. Yield gain over the control treatment at different control yields in one sentinel site (Nkhata bay in Malawi) during the 2010-2011 cropping season.

Further, plots of response against cumulative frequency are made and show the variability in the yield difference from control considering all the fields within a site. This also shows the proportion of fields where a given treatment in a sentinel site achieves yield lower or higher than the corresponding control or any other benchmark of interest. Such a benchmark could be an economic minimum or set at an expected benefit cost ratio. In Figure 5, for example, plots with PK as treatment will have yield lower than control in 35% of the cases. For a more complete site by site results and interpretation, please see “AfSIS Obj4 Analysis.pdf”. 

Figure 5. Cumulative frequency plots of maize grain yield for Mbinga, Tanzania during the 2010-2011 cropping seasons.

Data was checked for normality by constructing histograms (Figure 6). If data was not normally distributed, log ratio of the treatment:control was used in the formal analysis.



Figure 6. Histograms of yield difference and log transformed treatment:control ratio data for Pampaida, Nigeria



Author: Dr Job Kihara Email: This email address is being protected from spambots. You need JavaScript enabled to view it.