Senior Year. The final stretch of college. Now is the time to get your cap and gown, take all sorts of pictures and do you last bit of partying. Senioritis is in full swing and it's time to relax, right?
I had the privilege of working of Anheuser Busch InBev in their tech explore department out of the Newark Brewery. I grew up outside of Newark and it was surreal going to work at a place I'd always see looming in the distance in the south Newark skies.
I had the usual duties an intern had, getting lunch, managing partners working with the tech explore team, general office upkeep. My purpose there, however was to work on a dataset that I would present as my capstone project
I met Scott Pemberton the brewmaster at the plant manager at the Labatt Brewing Company and my supervisor presented me with a problem. AB InBev is a company that prides itself on safety and in the United States the company overall has a good standing. Abroad, the company has some issues with forklift collisions. The company already invested money into RFID technology and wanted to understand where to apply it; this is where I come it.
I remembered a technique we discussed in my Unstructured Data Analysis class about comparing documents using TFIDF analysis. For those unfamiliar with the concept, TFIDF is a relative measure of the frequency of a term in all documents. The more a term appears within a document means that term is stronger however if that term appears frequently in all documents that term's weight is smaller.
My first step was acquiring a dataset. I got in touch with the woman who runs the Credit 360 reporting system for AB InBev. She walked me through their incident reporting database and I was able to acquire a CSV of incidents from the last 4 years. The fields included werer:
The preprocessing starts for me by renaming some fields for ease of access

Another task in this project is to calculate the probability that an incident will occur given a time of day of the week or the location for this purpose we need to convert the date colm which contains numerical data to a datetime object that python can recognize as a weekday.

The next step is to split the strings that are in the description column so each word can be take into account for its frequency. To avoid redundancy we normalize the words by bringing them all to a lower case.

Taking a look at the dataset in its raw form, you notice the recurrence of the critical word "forklift." Being that all these incidents pertain to forklifts its expected that the word forklift would appear in just about every incident report. So I used a word frequency distribution to determine the words that are "noisy" they contribute nothing to the analysis because they throw off the TFIDF stat.
We also need to take out the stopwords which we have set to the variable "stop" stopwords are words like "the, is, a."
A data structure is needed to represent the presence of a word in an incident reported per location. Several locations may have the same words in their incident reports so we need to attach the word to each location. Note: we don't need a list of discrete words meaning we aren't looking for only the presence of the word at that location but frequency is take into account as well. So a word like driver can appear more than once in an incident.

We'll count the frequency of occurrence of a word and group them by location

Then we count the number of words used in every incident by location.

There are two steps to calculating TFID First we calculate the term frequency, the amount of times the term appears in all incident reports

Inverse document frequency is how many times the term appears per document showing how important that word is to that document or in this case how important that word is in the incidents reported at a brewery.

The TFIDF stat is a product of the term frequency and the inverse document frequency which together shows how important that term is relatively important to AB In-Bev and how often those problems occur across all breweries.
For analysis of the locations causing the most problems I built a probabilistic model in Microsoft Azure to look at which conditions lead to a greater probability of occurrence at each location. SImply put I looked at the day of the week, month of the year and the location and used Safety Hazard as my target variable running the data through a logistic regression algorithm. THe scored probabilities are as follows:
I have visualized the TFIDF of the highest probable locations Lusaka breweries in Zambia and the Ibhayi brewery in South Africa. For the purposes of my tasking I have also included the Crestron brewery.
Recurring issues include driver error, PPE being misused or altogether neglected, and issues of spacing where the driver is operating. I gave these results to the directory of Tech Explore and he assures me that these are the target areas which AB IbBev will look to apply new technologies. I also gave recommendations to be wary of operators actions during the months of June and July and at the end of the week.
What is happening at these breweries seems to be gaps in discipline. People are excited that it's the weekend and that it is summer time. They are calling their partners and planning their activities, they are hot and relax their PPE to feel more comfortable. They are haste in their work and think about how they can complete it faster instead of how to complete it safer. What ABI Africa needs to do is reiterate safety practices and have their employees know that no matter how long a job takes, getting through it safe will ensure everyone has fun when they aren't working.
My final senior semester included distance races, conferences in the White House, independent data analysis projects (which we'll take a look at later), independent studying of three languages, a full course load and an internship.
I had the usual duties an intern had, getting lunch, managing partners working with the tech explore team, general office upkeep. My purpose there, however was to work on a dataset that I would present as my capstone project
I met Scott Pemberton the brewmaster at the plant manager at the Labatt Brewing Company and my supervisor presented me with a problem. AB InBev is a company that prides itself on safety and in the United States the company overall has a good standing. Abroad, the company has some issues with forklift collisions. The company already invested money into RFID technology and wanted to understand where to apply it; this is where I come it.
I remembered a technique we discussed in my Unstructured Data Analysis class about comparing documents using TFIDF analysis. For those unfamiliar with the concept, TFIDF is a relative measure of the frequency of a term in all documents. The more a term appears within a document means that term is stronger however if that term appears frequently in all documents that term's weight is smaller.
My first step was acquiring a dataset. I got in touch with the woman who runs the Credit 360 reporting system for AB InBev. She walked me through their incident reporting database and I was able to acquire a CSV of incidents from the last 4 years. The fields included werer:
- Reference Number: Unique ID
- Location: Brewery or distribution site name
- Date
- Incident reported by
- Description
- Is this incident Severe or High Priority
All these fields were strings when imported into python. Speaking of python lets go ahead and import the packages I'll need.
The preprocessing starts for me by renaming some fields for ease of access

Another task in this project is to calculate the probability that an incident will occur given a time of day of the week or the location for this purpose we need to convert the date colm which contains numerical data to a datetime object that python can recognize as a weekday.

The next step is to split the strings that are in the description column so each word can be take into account for its frequency. To avoid redundancy we normalize the words by bringing them all to a lower case.

Taking a look at the dataset in its raw form, you notice the recurrence of the critical word "forklift." Being that all these incidents pertain to forklifts its expected that the word forklift would appear in just about every incident report. So I used a word frequency distribution to determine the words that are "noisy" they contribute nothing to the analysis because they throw off the TFIDF stat.
We also need to take out the stopwords which we have set to the variable "stop" stopwords are words like "the, is, a."

A data structure is needed to represent the presence of a word in an incident reported per location. Several locations may have the same words in their incident reports so we need to attach the word to each location. Note: we don't need a list of discrete words meaning we aren't looking for only the presence of the word at that location but frequency is take into account as well. So a word like driver can appear more than once in an incident.

We'll count the frequency of occurrence of a word and group them by location

Then we count the number of words used in every incident by location.

There are two steps to calculating TFID First we calculate the term frequency, the amount of times the term appears in all incident reports

Inverse document frequency is how many times the term appears per document showing how important that word is to that document or in this case how important that word is in the incidents reported at a brewery.

The TFIDF stat is a product of the term frequency and the inverse document frequency which together shows how important that term is relatively important to AB In-Bev and how often those problems occur across all breweries.
For analysis of the locations causing the most problems I built a probabilistic model in Microsoft Azure to look at which conditions lead to a greater probability of occurrence at each location. SImply put I looked at the day of the week, month of the year and the location and used Safety Hazard as my target variable running the data through a logistic regression algorithm. THe scored probabilities are as follows:
I have visualized the TFIDF of the highest probable locations Lusaka breweries in Zambia and the Ibhayi brewery in South Africa. For the purposes of my tasking I have also included the Crestron brewery.
What is happening at these breweries seems to be gaps in discipline. People are excited that it's the weekend and that it is summer time. They are calling their partners and planning their activities, they are hot and relax their PPE to feel more comfortable. They are haste in their work and think about how they can complete it faster instead of how to complete it safer. What ABI Africa needs to do is reiterate safety practices and have their employees know that no matter how long a job takes, getting through it safe will ensure everyone has fun when they aren't working.




Comments
Post a Comment