Data Science Replication Study

Team A

Data Science for Business

Team A:

Mohamad Abdulla

Anvith Amin

Manjunath Mallikarjun Kendhuli

Papers Reviewed

  • Paper 1: Predicting Employee Attrition (IBM)
  • Paper 2: Data Analytics for Optimizing and Predicting Employee Performance
  • Paper 3: Migration and Innovation: Learning from Patent and Inventor Data
  • Challenges faced during the project

Selected Paper

“The Political Economy of Green Industrial Policy”

Juhász et al., 2022

  • Used Global Trade Alert (GTA) database
  • Three key figures showing green policy trends in G20 countries

Problems Faced

  • Unclear objectives at the beginning
  • Extremely large and complex datasets
  • GitHub deployment issues

Replication of Figure 1

  • Title: Green Industrial Policy Activity in G20 Countries (2010–2022)
  • What it shows:
    • Annual green policy activity for Middle-income vs. High-income countries
    • Indexed to 2010–2012 average = 100
    • High-income line is scaled (divided by 5) for visual comparison
  • Axes:
    • Left Y-axis: Middle-income index
    • Right Y-axis: High-income index (scaled)

Replication Figure

Fig 1: Green Industrial Policy Activity in G20 Countries, 2010–2022

Code Logic Summary

  • Step 1: Load and clean raw data
    • Import original IP_G20.dta file
    • Filter valid rows and deduplicate by MeasureID–Year–Country
  • Step 2: Identify green policies
    • Use keywords like climate, emission, renewable to flag green measures
  • Step 3: Add income group classification
    • Load World Bank Excel data
    • Reshape to long format and convert fiscal to calendar years
    • Merge with green policy data by country and year
  • Step 4: Standardize income group labels
    • Map H to “High-income”, LM/UM to “Middle-income”
    • Remove unmatched or missing classifications
  • Step 5: Count policies per year
    • Group by year and income group
    • Count number of green policies announced
  • Step 6: Compute 2010–2012 baseline
    • Calculate average policy count in 2010–2012 for each group
  • Step 7: Index calculation
    • Create index: (policy_count / baseline_avg) * 100
    • Expresses annual activity relative to baseline (baseline = 100)
  • Step 8: Visualization
    • Plot both income groups on one chart
    • Scale high-income index by /5 on secondary Y-axis for comparison

R Code for Replication

# -------------------------------
# STEP 1: Load Required Libraries
# -------------------------------
if (!require(pacman)) install.packages("pacman")
pacman::p_unload(all)
pacman::p_load(tidyverse, haven, janitor)

# -------------------------------
# STEP 2: Load the Raw Dataset
# -------------------------------

# Define the correct raw URL of the .dta file
url <- "https://raw.githubusercontent.com/anvitho07/Data-science-presentation-final/main/IP_G20.dta"

# Create a temporary local file to download the dataset
temp_file <- tempfile(fileext = ".dta")

# Download the dataset from GitHub
download.file(url, temp_file, mode = "wb")  # use "wb" for binary files

# Read the dataset
raw_data <- read_dta(temp_file)

# -------------------------------
# STEP 3: Filter Valid Policies and Deduplicate
#         (one row per policy-year-country combination)
# -------------------------------
green_data <- raw_data %>%
  filter(!is.na(MeasureDescriptionTokens)) %>%
  filter(!is.na(AnnouncedYear)) %>%
  distinct(MeasureID, AnnouncedYear, CountryStd, .keep_all = TRUE)

# -------------------------------
# STEP 4: Classify Countries as High-income vs Middle-income
#         (same static grouping as in Stata script)
# -------------------------------
high_income <- c("Australia", "Canada", "France", "Germany", "Italy", "Japan",
                 "Republic of Korea", "United Kingdom", "United States", "European Union")

green_data <- green_data %>%
  mutate(IncomeGroup = ifelse(CountryStd %in% high_income, "High-income", "Middle-income"))

# -------------------------------
# STEP 5: Count Green Policies Per Year & Group
# -------------------------------
counts <- green_data %>%
  group_by(AnnouncedYear, IncomeGroup) %>%
  summarise(policy_count = n(), .groups = "drop")

# -------------------------------
# STEP 6: Compute the 2010–2012 Average (Baseline Index)
# -------------------------------
baseline <- counts %>%
  filter(AnnouncedYear %in% 2010:2012) %>%
  group_by(IncomeGroup) %>%
  summarise(baseline_avg = mean(policy_count), .groups = "drop")

# -------------------------------
# STEP 7: Calculate Index as Percent of 2010–2012 Baseline
#         (Multiplied by 1000 to match paper's visual scaling)
# -------------------------------
final <- counts %>%
  left_join(baseline, by = "IncomeGroup") %>%
  mutate(index = (policy_count / baseline_avg) * 1000)

# -------------------------------
# STEP 8: Separate Data for Visualization
# -------------------------------
mid_df <- final %>% filter(IncomeGroup == "Middle-income")
high_df <- final %>% filter(IncomeGroup == "High-income")

# -------------------------------
# STEP 9: Plot Final Figure
# -------------------------------
ggplot() +
  geom_line(data = mid_df, aes(x = AnnouncedYear, y = index, color = "Middle-income"), size = 1.2) +
  geom_line(data = high_df, aes(x = AnnouncedYear, y = index, color = "High-income"), size = 1.2) +
  scale_y_continuous(
    name = "Total middle-income green industrial policy activity\n(percent of 2010–2012)",
    sec.axis = sec_axis(~ . * (14000 / max(mid_df$index)), name = "Total high-income green industrial policy activity\n(percent of 2010–2012)")
  ) +
  scale_color_manual(values = c("Middle-income" = "purple", "High-income" = "skyblue"), name = NULL) +
  scale_x_continuous(breaks = c(2010, 2013, 2016, 2019, 2022)) +
  labs(
    title = "Green Industrial Policy Activity in G20 Countries, 2010–2022",
    subtitle = "Annual Count of Policies Relative to 2010–2012 Average",
    x = "Year"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    axis.title.y.left = element_text(margin = margin(r = 10)),
    axis.title.y.right = element_text(margin = margin(l = 10)),
    legend.position = "top"
  )

Output of Figure 1 Replication

Replication of Figure 2

  • Title: Top Five Green Industrial Policy Instruments across G20 Economies by Income Group (2010–2022)
  • What it shows:
    • Distribution of green industrial policies by instrument type (e.g. financial grant, state loan) -Comparison between High-income and Middle-income G20 countries -Focuses only on the top five most frequent instruments within each group -Measures are shown as shares of total green policy activity, normalized within each group
  • Axes:
    • Left Y-axis: Income group (High-income / Middle-income)
    • Right Y-axis: Share of green policies by instrument type

Replication Figure

Figure 2: Top five industrial policy intruments

Code Logic Summary – Figure 2

  • Step 1: Load and prepare data

    • Import IP_G20.dta (policy dataset) and wb.xlsx (income classification)
    • Standardize column names using clean_names()
  • Step 2: Filter relevant policies

    • Keep policies from 2010–2022 with non-missing descriptions
    • These represent green or environmentally relevant measures
  • Step 3: Assign income group

    • Use a fixed list to classify countries as High-income or Middle-income
    • Add this classification to each policy record
  • Step 4: Identify top 5 policy instruments

    • Count frequency of each policy tool (measure_type)
    • Select the top 5 most common types separately for each income group
  • Step 5: Compute usage shares

    • Within each group, calculate how much each of the top 5 instruments was used
    • Expressed as a share of total green policies in that group (0.0 to 1.0)
  • Step 6: Visualize with stacked bar chart

    • Plot horizontal bars showing instrument composition by income group
    • Use coord_flip() to flip axes and number_format() to show decimals
  • Step 7: Display output

    • Render the plot with minimal styling and a grouped color legend

R Code for Replication

#  0) Load packages
if (!requireNamespace("pacman", quietly = TRUE)) install.packages("pacman")
pacman::p_load(
  haven,      # for read_dta()
  readxl,     # for read_xlsx()
  dplyr,      
  tidyr,
  ggplot2,
  scales,     # for number_format()
  janitor     # for clean_names()
)
# -------------------------------
#  1) Ingest and clean-up 
# -------------------------------

ip <- read_dta("E:/github/data science -presentation final/Data science presentation/Data-science-presentation-final/IP_G20.dta") %>%
  clean_names()

# Read and clean the income group Excel file
wb <- read_xlsx("E:/github/data science -presentation final/Data science presentation/Data-science-presentation-final/wb.xlsx") %>%
  clean_names()

# -------------------------------
#  2) Filter green policies (2010–2022) and assign income group 
# -------------------------------
high_income_countries <- c(
  "Australia", "Canada", "France", "Germany", "Italy",
  "Japan", "Republic of Korea", "United Kingdom", "United States", "European Union"
)

green <- ip %>%
  filter(
    !is.na(measure_description_tokens),
    announced_year >= 2010,
    announced_year <= 2022
  ) %>%
  mutate(
    income_group = if_else(
      country_std %in% high_income_countries,
      "High-income",
      "Middle-income"
    )
  )
# -------------------------------
# 3) Identify top 5 instrument types per income group 
# -------------------------------
top5 <- green %>%
  count(income_group, measure_type, name = "n") %>%
  group_by(income_group) %>%
  slice_max(n, n = 5) %>%
  pull(measure_type) %>%
  unique()
# -------------------------------
#  4) Compute shares of top 5 instruments 
# -------------------------------
plot_df <- green %>%
  filter(measure_type %in% top5) %>%
  count(income_group, measure_type, name = "n") %>%
  group_by(income_group) %>%
  mutate(share = n / sum(n)) %>%
  ungroup()
# -------------------------------
#  5) Build the horizontal stacked bar plot
# -------------------------------
p <- ggplot(plot_df, aes(
    x = income_group,
    y = share,
    fill = measure_type
  )) +
  geom_col(position = "fill", width = 0.6) +
  coord_flip() +
  scale_y_continuous(
    breaks = seq(0, 1, by = 0.2),
    labels = number_format(accuracy = 0.1),
    expand = expansion(mult = c(0, 0))
  ) +
  labs(
    title = "Top Five Green Industrial Policy Instruments Across G20 Economies\nby Income Group, 2010–2022",
    x = NULL,
    y = "Share of green industrial policy by instrument type",
    fill = "Instrument Type"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title      = element_text(face = "bold", size = 16, hjust = 0),
    axis.text.y     = element_text(size = 12),
    panel.grid.minor = element_blank(),
    legend.position = "bottom",
    legend.key.size = unit(0.7, "lines")
  )
# -------------------------------
# ️ 6) Display the plot
# -------------------------------
print(p)

Output of Figure 2 Replication

Challenges faces

  • Uploaded local data files to GitHub and linked using raw URLs so others could run the code.

  • Replaced percent_format() with number_format() to show axis labels as decimals.

  • Renamed output to index.html so GitHub Pages would display the updated version.

Future Work with this Replication

  • Clean and verify all country names
    • Match them correctly with World Bank data to fix missing values
  • Improve keyword filtering
    • Refine how we identify green policies using better or more complete keywords
  • Match Stata version exactly
    • Compare our R code outputs with the original Stata graphs for full accuracy
  • Check missing data
    • Investigate why some years or countries have fewer policies than expected
  • Automate income group assignment
    • Instead of manual grouping, use official classification files from the World Bank