Paper 2: Data Analytics for Optimizing and Predicting Employee Performance
Paper 3: Migration and Innovation: Learning from Patent and Inventor Data
Challenges faced during the project
Selected Paper
“The Political Economy of Green Industrial Policy”
Juhász et al., 2022
Used Global Trade Alert (GTA) database
Three key figures showing green policy trends in G20 countries
Problems Faced
Unclear objectives at the beginning
Extremely large and complex datasets
GitHub deployment issues
Replication of Figure 1
Title: Green Industrial Policy Activity in G20 Countries (2010–2022)
What it shows:
Annual green policy activity for Middle-income vs. High-income countries
Indexed to 2010–2012 average = 100
High-income line is scaled (divided by 5) for visual comparison
Axes:
Left Y-axis: Middle-income index
Right Y-axis: High-income index (scaled)
Replication Figure
Fig 1: Green Industrial Policy Activity in G20 Countries, 2010–2022
Code Logic Summary
Step 1: Load and clean raw data
Import original IP_G20.dta file
Filter valid rows and deduplicate by MeasureID–Year–Country
Step 2: Identify green policies
Use keywords like climate, emission, renewable to flag green measures
Step 3: Add income group classification
Load World Bank Excel data
Reshape to long format and convert fiscal to calendar years
Merge with green policy data by country and year
Step 4: Standardize income group labels
Map H to “High-income”, LM/UM to “Middle-income”
Remove unmatched or missing classifications
Step 5: Count policies per year
Group by year and income group
Count number of green policies announced
Step 6: Compute 2010–2012 baseline
Calculate average policy count in 2010–2012 for each group
Step 7: Index calculation
Create index: (policy_count / baseline_avg) * 100
Expresses annual activity relative to baseline (baseline = 100)
Step 8: Visualization
Plot both income groups on one chart
Scale high-income index by /5 on secondary Y-axis for comparison
R Code for Replication
# -------------------------------# STEP 1: Load Required Libraries# -------------------------------if (!require(pacman)) install.packages("pacman")pacman::p_unload(all)pacman::p_load(tidyverse, haven, janitor)# -------------------------------# STEP 2: Load the Raw Dataset# -------------------------------# Define the correct raw URL of the .dta fileurl <-"https://raw.githubusercontent.com/anvitho07/Data-science-presentation-final/main/IP_G20.dta"# Create a temporary local file to download the datasettemp_file <-tempfile(fileext =".dta")# Download the dataset from GitHubdownload.file(url, temp_file, mode ="wb") # use "wb" for binary files# Read the datasetraw_data <-read_dta(temp_file)# -------------------------------# STEP 3: Filter Valid Policies and Deduplicate# (one row per policy-year-country combination)# -------------------------------green_data <- raw_data %>%filter(!is.na(MeasureDescriptionTokens)) %>%filter(!is.na(AnnouncedYear)) %>%distinct(MeasureID, AnnouncedYear, CountryStd, .keep_all =TRUE)# -------------------------------# STEP 4: Classify Countries as High-income vs Middle-income# (same static grouping as in Stata script)# -------------------------------high_income <-c("Australia", "Canada", "France", "Germany", "Italy", "Japan","Republic of Korea", "United Kingdom", "United States", "European Union")green_data <- green_data %>%mutate(IncomeGroup =ifelse(CountryStd %in% high_income, "High-income", "Middle-income"))# -------------------------------# STEP 5: Count Green Policies Per Year & Group# -------------------------------counts <- green_data %>%group_by(AnnouncedYear, IncomeGroup) %>%summarise(policy_count =n(), .groups ="drop")# -------------------------------# STEP 6: Compute the 2010–2012 Average (Baseline Index)# -------------------------------baseline <- counts %>%filter(AnnouncedYear %in%2010:2012) %>%group_by(IncomeGroup) %>%summarise(baseline_avg =mean(policy_count), .groups ="drop")# -------------------------------# STEP 7: Calculate Index as Percent of 2010–2012 Baseline# (Multiplied by 1000 to match paper's visual scaling)# -------------------------------final <- counts %>%left_join(baseline, by ="IncomeGroup") %>%mutate(index = (policy_count / baseline_avg) *1000)# -------------------------------# STEP 8: Separate Data for Visualization# -------------------------------mid_df <- final %>%filter(IncomeGroup =="Middle-income")high_df <- final %>%filter(IncomeGroup =="High-income")# -------------------------------# STEP 9: Plot Final Figure# -------------------------------ggplot() +geom_line(data = mid_df, aes(x = AnnouncedYear, y = index, color ="Middle-income"), size =1.2) +geom_line(data = high_df, aes(x = AnnouncedYear, y = index, color ="High-income"), size =1.2) +scale_y_continuous(name ="Total middle-income green industrial policy activity\n(percent of 2010–2012)",sec.axis =sec_axis(~ . * (14000/max(mid_df$index)), name ="Total high-income green industrial policy activity\n(percent of 2010–2012)") ) +scale_color_manual(values =c("Middle-income"="purple", "High-income"="skyblue"), name =NULL) +scale_x_continuous(breaks =c(2010, 2013, 2016, 2019, 2022)) +labs(title ="Green Industrial Policy Activity in G20 Countries, 2010–2022",subtitle ="Annual Count of Policies Relative to 2010–2012 Average",x ="Year" ) +theme_minimal(base_size =13) +theme(axis.title.y.left =element_text(margin =margin(r =10)),axis.title.y.right =element_text(margin =margin(l =10)),legend.position ="top" )
Output of Figure 1 Replication
Replication of Figure 2
Title: Top Five Green Industrial Policy Instruments across G20 Economies by Income Group (2010–2022)
What it shows:
Distribution of green industrial policies by instrument type (e.g. financial grant, state loan) -Comparison between High-income and Middle-income G20 countries -Focuses only on the top five most frequent instruments within each group -Measures are shown as shares of total green policy activity, normalized within each group
Axes:
Left Y-axis: Income group (High-income / Middle-income)
Right Y-axis: Share of green policies by instrument type
Replication Figure
Figure 2: Top five industrial policy intruments
Code Logic Summary – Figure 2
Step 1: Load and prepare data
Import IP_G20.dta (policy dataset) and wb.xlsx (income classification)
Standardize column names using clean_names()
Step 2: Filter relevant policies
Keep policies from 2010–2022 with non-missing descriptions
These represent green or environmentally relevant measures
Step 3: Assign income group
Use a fixed list to classify countries as High-income or Middle-income
Add this classification to each policy record
Step 4: Identify top 5 policy instruments
Count frequency of each policy tool (measure_type)
Select the top 5 most common types separately for each income group
Step 5: Compute usage shares
Within each group, calculate how much each of the top 5 instruments was used
Expressed as a share of total green policies in that group (0.0 to 1.0)
Step 6: Visualize with stacked bar chart
Plot horizontal bars showing instrument composition by income group
Use coord_flip() to flip axes and number_format() to show decimals
Step 7: Display output
Render the plot with minimal styling and a grouped color legend
R Code for Replication
# 0) Load packagesif (!requireNamespace("pacman", quietly =TRUE)) install.packages("pacman")pacman::p_load( haven, # for read_dta() readxl, # for read_xlsx() dplyr, tidyr, ggplot2, scales, # for number_format() janitor # for clean_names())# -------------------------------# 1) Ingest and clean-up # -------------------------------ip <-read_dta("E:/github/data science -presentation final/Data science presentation/Data-science-presentation-final/IP_G20.dta") %>%clean_names()# Read and clean the income group Excel filewb <-read_xlsx("E:/github/data science -presentation final/Data science presentation/Data-science-presentation-final/wb.xlsx") %>%clean_names()# -------------------------------# 2) Filter green policies (2010–2022) and assign income group # -------------------------------high_income_countries <-c("Australia", "Canada", "France", "Germany", "Italy","Japan", "Republic of Korea", "United Kingdom", "United States", "European Union")green <- ip %>%filter(!is.na(measure_description_tokens), announced_year >=2010, announced_year <=2022 ) %>%mutate(income_group =if_else( country_std %in% high_income_countries,"High-income","Middle-income" ) )# -------------------------------# 3) Identify top 5 instrument types per income group # -------------------------------top5 <- green %>%count(income_group, measure_type, name ="n") %>%group_by(income_group) %>%slice_max(n, n =5) %>%pull(measure_type) %>%unique()# -------------------------------# 4) Compute shares of top 5 instruments # -------------------------------plot_df <- green %>%filter(measure_type %in% top5) %>%count(income_group, measure_type, name ="n") %>%group_by(income_group) %>%mutate(share = n /sum(n)) %>%ungroup()# -------------------------------# 5) Build the horizontal stacked bar plot# -------------------------------p <-ggplot(plot_df, aes(x = income_group,y = share,fill = measure_type )) +geom_col(position ="fill", width =0.6) +coord_flip() +scale_y_continuous(breaks =seq(0, 1, by =0.2),labels =number_format(accuracy =0.1),expand =expansion(mult =c(0, 0)) ) +labs(title ="Top Five Green Industrial Policy Instruments Across G20 Economies\nby Income Group, 2010–2022",x =NULL,y ="Share of green industrial policy by instrument type",fill ="Instrument Type" ) +theme_minimal(base_size =14) +theme(plot.title =element_text(face ="bold", size =16, hjust =0),axis.text.y =element_text(size =12),panel.grid.minor =element_blank(),legend.position ="bottom",legend.key.size =unit(0.7, "lines") )# -------------------------------# ️ 6) Display the plot# -------------------------------print(p)
Output of Figure 2 Replication
Challenges faces
Uploaded local data files to GitHub and linked using raw URLs so others could run the code.
Replaced percent_format() with number_format() to show axis labels as decimals.
Renamed output to index.html so GitHub Pages would display the updated version.
Future Work with this Replication
Clean and verify all country names
Match them correctly with World Bank data to fix missing values
Improve keyword filtering
Refine how we identify green policies using better or more complete keywords
Match Stata version exactly
Compare our R code outputs with the original Stata graphs for full accuracy
Check missing data
Investigate why some years or countries have fewer policies than expected
Automate income group assignment
Instead of manual grouping, use official classification files from the World Bank