Team Members: Ben A., Fred F., Shafie G., Daniel J.
Program(s): SAS, R
Model(s): Logistic Regression, PCA, Clustering
Over the past few years, growth in the eSports industry has been stunning. Per an article written by ESPN, 205 million people have watched
or played eSports in 2014. Our group set out to assess the viability of an Electronic Sports (commonly referred to as eSports) themed bar
with the goal of designing a physical location that caters to people that watch or participate in eSports. This industry is growing immensely
and will not be adequately served by the traditional sports bar that is focused on mainstream sports such as football or basketball. Since this
is a burgeoning industry, there are limited competitors in the Charlotte, NC area. There are two potential rivals and both offer a retro-themed gaming experience complete
with playable arcade cabinets and vintage games. The first thing that we noticed about our potential competitors is that neither focus on modern
game consoles (Xbox One, Playstation 4, etc.). Additionally,
neither location provides options for customers who are interested in eSports, either viewing or playing online with other people. The product
that we are proposing is a bar that targets people that are interested in eSports, specifically viewing eSports tournaments or playing multiplayer
online games. The eSports bar will provide space for people to socialize while enjoying beverages and an environment tailored to viewing eSports.
Our marketing objectives were to assess the viability of an eSports bar near the university, hone the business model, and identify advertising and
partnership channels. There are four phases to solving our marketing problem. The first three objectives are exploratory, and the last seeks to
define the marketing opportunity.
Data Collection
Our method of data collection would be survey responses. We designed a survey that consisted of 21 questions that surveyed local students and the
surrounding university area. The survey was designed with the help of SurveyMonkey, an online survey development service. After several rounds of
designing the questions, layout, and design; we posted the survey to several university social media pages including Facebook and Twitter. With a
low hit rate, we took to the Student Union and conducted an in-person survey, flagging down potential surveyors and having them sit at one of the
provided computers to complete the survey in return for some candy. At the same time of offering an electronic version of the survey, we would often
have a team member go off in another location within the Student Union and offer a paper-based copy of the survey. Analyzing the distribution of our
student survey responders shows a moderate bias towards the university community population. We have no precise information to compare with non-student’s
responders which are less than 15% our total responders.
Principal Component Analysis
We performed principal components analysis on the survey questions to get a sense of the orthogonality of the survey questions we asked, and to provide
another means of identifying the underlying characteristics of our data set that account for the variance of the set. R was used for analysis, utilizing
the prcomp function to process the input data frame to produce principal components. The inputs were scaled to prevent dimensions with higher magnitudes
from dominating the results. The PCA analysis produced a set of vectors in the survey space that generated a new basis for the data that accounted for the
greatest proportion of the set’s variance along its axes. The first component accounted for 13.6% of the variance, and subsequent components decayed as
expected. We set the tolerance, or percent variance/standard deviation at .06, as at that point the explanatory strength had started to level off. This
value left us with 8 components, which represented nearly half of the total variance in the set. (.489). A visual inspection of the components revealed
several important metrics that scaled together, and were weighted to produce the new basis. The interpretation of the principal components was challenging,
but the results are shown below:
Component | Label | Descriptor |
---|---|---|
PC1 | Core PC eSports fans | Hours played, PC gaming, First Person Shooter games, eSports interest, young age, daytime classes |
PC2 | Youthfulness , tending female | Undergrads, little money, low employment, mobile gaming |
PC3 | Console Gaming | Xbox, FPS games, some more money, evening classes |
PC4 | Conventional sports focus | Playing sports, watching sports, going to sports bars |
PC5 | RTS PC playing | RTS, Dota, PC |
PC6 | Casual college girl | Mall, restaurants, gender, cocktails |
PC7 | Grad student commuting | Evening classes, off campus |
PC8 | Casual gaming | anti consoles, anti playing sports, board games |
Coming Soon!
Team Members: Christy C., Rajah C., Lorenzo M, Abdullah M., Ryan W.
Program(s): Python, R, Hadoop, Tableau, Gephi
Model(s): Topic Modeling, Graph analysis, Clustering
We analyzed 3M’s US patent portfolio relative to seven competitors using topic modeling, k-means clustering and network analysis. The dataset includes about
33,000 patents for the eight companies. Relative to the selected competitors, 3M has a competitive advantage in areas like Stock Materials, Synthetic Resins,
Optical Systems, and Adhesives. Using cosine similarity on the class distribution by company, 3M’s patent portfolio is most like Bostik, Dow, and Du Pont, who are
predominately Synthetic Materials and Chemistry companies. General Electric and Siemens focus more on Energy and Data & Processing patents; however, these companies
compete with 3M in areas like Surgery and Stock Materials patents.
Next, we used Topic Modeling (Latent Dirichlet Allocation or LDA) to identify five topics in the patents’ abstracts. We label the five topics as: Synthetic Materials,
Chemistry, Energy, Electrical and Data & Processing. We then use k-means clustering on the topic probabilities to create five patent clusters, each corresponding to
an LDA topic. Most of 3M’s patents (75%) are in Synthetic Materials and Chemistry patents.
Finally, we examine shared Patent Title bigrams to analyze the relationship between patents using network analysis. Many of 3M’s patents share ideas and cross-pollinate
to create new, distruptive innovations. Synthetic Materials and Chemistry patents exhibit the strongest cross-pollination as they share many bigrams in classes like
Compositions, Adhesives, Abrasives and Synthetic Resins. On the other hand, 3M’s Electrical and Data & Processing patents are largely disconnected from the network.
3M has a competitive disadvantage in these patents as GE, Honeywell and Siemens dominant the market. We also expanded our analysis to competitors in Synthetic Materials
and Chemistry patents and 3M research collaboration with an assignee network.
What Does All That Mean??
While 3M still enjoy a dominant position in Synthetic Materials, the external Dow/DuPont merger seriously threatens their market position, while internal initiatives
from CEO Inge Thulin, attempting to increase patent quantity, have not yielded the desired results. Our research uncovered an objective way to measure the value of a
patent: cross-pollination. Similar to friends in a social network, we used Betweenness as a network measure of cross-pollination. GE seems to have mastered
cross-pollination and enjoys an enviable position in the patent world; 3M has only a toe-hold in the world of cross-pollinators.
We recommend cultivating our cross-pollination and beat competitors at their own game. Now that we know how to measure cross-pollination, 3M can both reward the best
in-house inventors and snap up the best independent (freelance) inventors. Extending the cross-pollination measure to companies, 3M have a way to acquire small companies
that cross-pollinate, extending our innovative reach. And for foreign or large companies, 3M should form strategic alliances, again using cross-pollination as a yardstick.
Cross-pollination will create more patents, which in turn will bring more unique products to the market, creating value for customers. Diverting one-tenth of the R&D
Budget to cross-pollination could help this strategy succeed without increasing expenditures. We would spend the same dollars, but in a more focused, targeted fashion.
A cross-pollination strategy should set Thulin’s innovation vision aright and ensure his long tenure at the helm. Never has 3M had such a powerful, data-driven tool
with which to drive innovation in their patent program.
Coming Soon!
Team Members: Null
Program(s): R
Model(s): Clustering
For more detail visit my github
The purpose of this project was to look at the demographic profile of the people who vote for Donald Trump. This led me to create clusters of demographic profiles and see which cluster votes for Donald Trump more compared to the other clusters. The dataset contains primary results, demographic data, and a codebook. (Note: The data contains results from primary races as of March 2016, which is the reason why every state is not represented.) Creation of several other datasets during the project were performed to look at numbers more accurately. The first phase I focused solely on the primary_results.csv file. Created separate datasets to look at counties won, states won, and party ratios, all of which relied heavily on the fraction_votes variable. During the second phase I was concerned more with demographic data and needed to merge the two datasets together.
Team Members: Weiwei G., Yang L., Sara M.
Program(s): SAS, R, Python
Model(s): Logistic Regression, Clustering, Neural Network, Decision Tree, Random Forests
Human Immunodeficiency Virus (HIV) is a virus spreading around the world. Per the National Institute of Allergy and Infectious
Diseases (NIAID) nearly 37 million people have HIV and 1.2 million people are living with HIV in the United States. However, a
large portion are unaware of their HIV discovery. To decrease the transmission of the diseases and improve women’s health,
Microsoft announced the Women’s Health Risk Assessment competition that looks for machine learning solutions, in which Bill &
Melinda Gates funded the competition. The focus of the competition is in the underdeveloped areas of the world, including most of
Africa and India. In association with the World Health Organization (WHO) they developed a survey that reached 9,000 women in clinics
and approximately 1,000 subjects in each region. The purpose of this study is to help clinics target the disease at risk and plan
health care budget more efficiently so that more women can be treated in underdeveloped countries. The problem is a binary classification problem, low or
high risk of HIV. As far as the expectation, researchers want to see what are the most important things that are associated with the infection of HIV
among women ages 15-44.
For us to build a relatively accurate model, we needed to find out which features that matter the most. To do it properly, we performed a
random forest model on our dataset. The random forest model in Python scikit-learn package offers a way to see the importance of each
independent variable relative to the dependent variable. The most important features revealed from this analysis are modcon, tribe, district,
babydoc, labordeliv, geo, segment, debut, usecondom, and hivknow. Using these variables, among others, we decided to develop several different
models including logistic regression, decision tree, random forests, and a neural network. The takeaways from this, along with model creation
and validation, is that individuals with access to modern technologies (computers, radios, irrigation, etc.) and using modern contraception and/or
condoms will lower the risk of HIV. Other factors that can radically influence the risk is the religion they follow, education, literacy, and
income.
We compared the models using mean square error (MSE) on validation data to select the best model for the project. Considering we used
variable clustering for the predictor selection prior to logistic regression, decision tree, and neural network, we used SAS e-Miner
to compare these three models, and the program selected the neural network to be the best model with the MSE of 0.12. However, compared
with the MSE generated using cross-validation from ridge regression and random forest, we believe that random forest is the best model
to predict HIV subgroup in this project with a MSE of 0.07.
Finally, our group wanted to look at some sort of unsupervised learning to experiment with. The dataset lends itself well to be segmented
especially considering that it contains a myriad of health, demographic, and socio-economic features. The clustering analysis was done
through R, and we used the k-means approach. After consulting the within sum of squares “elbow” plot, k was determined to be set at 3.
Looking at the means of each cluster it becomes clear that there are differences among them.
Cluster 1: Educated Young Women - This cluster has the youngest average age of all the other clusters, 18. Contains the
lowest percentage of respondents that responded yes to ever_had_sex and ever_been_pregnant, which leads to the smallest ratio of the
clusters that currently have a child. Women in this cluster are educated as well, 31% of them are currently in school, they are literate and
are the most educated of the three clusters. Lowest risk of HIV infection and mostly christian or muslim.
Cluster 2: Sexually Active Mid 20s Women - As the name says, this cluster’s average age is 26 and they contain the highest
percentage of women that have had sex, been pregnant, and had multiple partners. Lowest percentage (6%) in school, which could be explained
since they are older and may be out of school now. They are still literate and moderately educated. Highest HIVknow status, and most likely
to be married and/or have children. Highest risk of HIV infection and mostly christian or muslim.
Cluster 3: Other - This cluster is all over the place, there is no proper way to describe them. A couple things to note
is that they are mostly muslim and hindu; and they are the least educated. However, they are more likely to own a cooker or fridge compared
to the other clusters.
Coming Soon!
Team Members: Marcia P.
Program(s): R, Python, PySpark, Hadoop
Model(s): Clustering
For more detail visit my github
This project was perhaps the most fun and the most difficult of all the projects that I completed during my time at university. It was my first venture into Spark and coding up a script to run autonomously based on a single command. The domain was also interesting, my teammate, for all intents and purposes, was an expert in the field of patents; she had performed numerous other projects on the subject. The reasoning behind this assignment was that in 2013, the USPTO started transitioning from a legacy patent classification system (the US Patent Classification System or USPC system) to the new Cooperative Patent Classification (CPC) system.