Projects

Bar/Arcade Project

Team Members: Ben A., Fred F., Shafie G., Daniel J.
Program(s): SAS, R
Model(s): Logistic Regression, PCA, Clustering


Over the past few years, growth in the eSports industry has been stunning. Per an article written by ESPN, 205 million people have watched or played eSports in 2014. Our group set out to assess the viability of an Electronic Sports (commonly referred to as eSports) themed bar with the goal of designing a physical location that caters to people that watch or participate in eSports. This industry is growing immensely and will not be adequately served by the traditional sports bar that is focused on mainstream sports such as football or basketball. Since this is a burgeoning industry, there are limited competitors in the Charlotte, NC area. There are two potential rivals and both offer a retro-themed gaming experience complete with playable arcade cabinets and vintage games. The first thing that we noticed about our potential competitors is that neither focus on modern game consoles (Xbox One, Playstation 4, etc.). Additionally, neither location provides options for customers who are interested in eSports, either viewing or playing online with other people. The product that we are proposing is a bar that targets people that are interested in eSports, specifically viewing eSports tournaments or playing multiplayer online games. The eSports bar will provide space for people to socialize while enjoying beverages and an environment tailored to viewing eSports. Our marketing objectives were to assess the viability of an eSports bar near the university, hone the business model, and identify advertising and partnership channels. There are four phases to solving our marketing problem. The first three objectives are exploratory, and the last seeks to define the marketing opportunity.

Data Collection

Our method of data collection would be survey responses. We designed a survey that consisted of 21 questions that surveyed local students and the surrounding university area. The survey was designed with the help of SurveyMonkey, an online survey development service. After several rounds of designing the questions, layout, and design; we posted the survey to several university social media pages including Facebook and Twitter. With a low hit rate, we took to the Student Union and conducted an in-person survey, flagging down potential surveyors and having them sit at one of the provided computers to complete the survey in return for some candy. At the same time of offering an electronic version of the survey, we would often have a team member go off in another location within the Student Union and offer a paper-based copy of the survey. Analyzing the distribution of our student survey responders shows a moderate bias towards the university community population. We have no precise information to compare with non-student’s responders which are less than 15% our total responders.

Principal Component Analysis

We performed principal components analysis on the survey questions to get a sense of the orthogonality of the survey questions we asked, and to provide another means of identifying the underlying characteristics of our data set that account for the variance of the set. R was used for analysis, utilizing the prcomp function to process the input data frame to produce principal components. The inputs were scaled to prevent dimensions with higher magnitudes from dominating the results. The PCA analysis produced a set of vectors in the survey space that generated a new basis for the data that accounted for the greatest proportion of the set’s variance along its axes. The first component accounted for 13.6% of the variance, and subsequent components decayed as expected. We set the tolerance, or percent variance/standard deviation at .06, as at that point the explanatory strength had started to level off. This value left us with 8 components, which represented nearly half of the total variance in the set. (.489). A visual inspection of the components revealed several important metrics that scaled together, and were weighted to produce the new basis. The interpretation of the principal components was challenging, but the results are shown below:

Principal Components
Component Label Descriptor
PC1 Core PC eSports fans Hours played, PC gaming, First Person Shooter games, eSports interest, young age, daytime classes
PC2 Youthfulness , tending female Undergrads, little money, low employment, mobile gaming
PC3 Console Gaming Xbox, FPS games, some more money, evening classes
PC4 Conventional sports focus Playing sports, watching sports, going to sports bars
PC5 RTS PC playing RTS, Dota, PC
PC6 Casual college girl Mall, restaurants, gender, cocktails
PC7 Grad student commuting Evening classes, off campus
PC8 Casual gaming anti consoles, anti playing sports, board games

Once the components were identified, we applied the linear weights to the survey results in each question on the test, and the result was a score representing each component for each survey completed. As expected, the largest component was reliable in predicting interest in the target establishment, however, interestingly, two other components produced statistically significant results that had a larger parameter estimate. PC3 and PC8 both produced positive results, and these PCs correspond to general measures of console gaming and casual gaming. This result is particularly encouraging, because it suggests that we need not depend on only the hardcore PC gaming community to drive profit. Unsurprisingly, the only significant PC that had a negative impact on predicting interest was the traditional female bar hopping and shopping measure.

Segmentation, Targeting, Positioning

This clustering analysis revealed three distinct customer segments. The first segment is young, male, aware and interested in eSports and has a reasonable social and gaming budget to be interested in our proposed Bar/Arcade. The second segment is similar in demographics to the first segment, but more of a casual gamer, they’re not focused on eSports, but still interested in gaming and have a slightly lower social and gaming budget. Both segments would be good customers for our bar/arcade, and would make up the target of our marketing focus. The third segment is older, more female than male, and not interested in gaming. These folks may be interested in attending our bar for the social atmosphere, as they do enjoy restaurants, shopping, and going to the cinema with friends, but hey are not interested in the gaming aspect of our bar/arcade. Target Marketing to the customer segments can take three different paths: Direct, Indirect and Social. Direct marketing includes methods to reach out to the desired customers to appeal to them. This could be advertising signs/posters in dormitories or on-campus, direct mailings to all (or selected) on-campus students, inclusion of coupons or invitations to specific events. Indirect marketing could include sponsorship of on-campus sporting events (signs inside the stadium, or advertisements on the back of admission tickets), advertising in campus publications (student newspapers, yearbook) or signs placed within student frequented locations (dining halls, snack-bars or student union). Social media advertising can include strategies to draw attention to Facebook page (attracting “likes”), sending Twitter messages on student group sites or other social media advertising. For positioning of our product, we have a challenge, as the best customer segments are not of legal drinking age, and therefore are not going to be our most profitable customers. We would need to pursue revenue streams that would be different than the typical bar, which could include industry partnerships (gaming hardware sponsorships, group events, or other co-branded eSports local or regional tournaments including entry fees or the like). There is a clear interest from the student population that were surveyed in having a bar/arcade near campus, however, to be a profitable venture, the greatest profit possibilities for a bar – being alcohol sales – does not fit with the demographics of the customers showing the highest interest in our proposed bar/arcade.


Predictive Capabilities of UNCC Admissions Criterion

Coming Soon!

Analyzing 3M Patents

Team Members: Christy C., Rajah C., Lorenzo M, Abdullah M., Ryan W.
Program(s): Python, R, Hadoop, Tableau, Gephi
Model(s): Topic Modeling, Graph analysis, Clustering

We analyzed 3M’s US patent portfolio relative to seven competitors using topic modeling, k-means clustering and network analysis. The dataset includes about 33,000 patents for the eight companies. Relative to the selected competitors, 3M has a competitive advantage in areas like Stock Materials, Synthetic Resins, Optical Systems, and Adhesives. Using cosine similarity on the class distribution by company, 3M’s patent portfolio is most like Bostik, Dow, and Du Pont, who are predominately Synthetic Materials and Chemistry companies. General Electric and Siemens focus more on Energy and Data & Processing patents; however, these companies compete with 3M in areas like Surgery and Stock Materials patents.

Next, we used Topic Modeling (Latent Dirichlet Allocation or LDA) to identify five topics in the patents’ abstracts. We label the five topics as: Synthetic Materials, Chemistry, Energy, Electrical and Data & Processing. We then use k-means clustering on the topic probabilities to create five patent clusters, each corresponding to an LDA topic. Most of 3M’s patents (75%) are in Synthetic Materials and Chemistry patents.

Finally, we examine shared Patent Title bigrams to analyze the relationship between patents using network analysis. Many of 3M’s patents share ideas and cross-pollinate to create new, distruptive innovations. Synthetic Materials and Chemistry patents exhibit the strongest cross-pollination as they share many bigrams in classes like Compositions, Adhesives, Abrasives and Synthetic Resins. On the other hand, 3M’s Electrical and Data & Processing patents are largely disconnected from the network. 3M has a competitive disadvantage in these patents as GE, Honeywell and Siemens dominant the market. We also expanded our analysis to competitors in Synthetic Materials and Chemistry patents and 3M research collaboration with an assignee network.

What Does All That Mean??

While 3M still enjoy a dominant position in Synthetic Materials, the external Dow/DuPont merger seriously threatens their market position, while internal initiatives from CEO Inge Thulin, attempting to increase patent quantity, have not yielded the desired results. Our research uncovered an objective way to measure the value of a patent: cross-pollination. Similar to friends in a social network, we used Betweenness as a network measure of cross-pollination. GE seems to have mastered cross-pollination and enjoys an enviable position in the patent world; 3M has only a toe-hold in the world of cross-pollinators.

We recommend cultivating our cross-pollination and beat competitors at their own game. Now that we know how to measure cross-pollination, 3M can both reward the best in-house inventors and snap up the best independent (freelance) inventors. Extending the cross-pollination measure to companies, 3M have a way to acquire small companies that cross-pollinate, extending our innovative reach. And for foreign or large companies, 3M should form strategic alliances, again using cross-pollination as a yardstick. Cross-pollination will create more patents, which in turn will bring more unique products to the market, creating value for customers. Diverting one-tenth of the R&D Budget to cross-pollination could help this strategy succeed without increasing expenditures. We would spend the same dollars, but in a more focused, targeted fashion. A cross-pollination strategy should set Thulin’s innovation vision aright and ensure his long tenure at the helm. Never has 3M had such a powerful, data-driven tool with which to drive innovation in their patent program.


SQL Bookstore Project

Coming Soon!

Donald Trump Voter Segmentation

Team Members: Null
Program(s): R
Model(s): Clustering

For more detail visit my github

The purpose of this project was to look at the demographic profile of the people who vote for Donald Trump. This led me to create clusters of demographic profiles and see which cluster votes for Donald Trump more compared to the other clusters. The dataset contains primary results, demographic data, and a codebook. (Note: The data contains results from primary races as of March 2016, which is the reason why every state is not represented.) Creation of several other datasets during the project were performed to look at numbers more accurately. The first phase I focused solely on the primary_results.csv file. Created separate datasets to look at counties won, states won, and party ratios, all of which relied heavily on the fraction_votes variable. During the second phase I was concerned more with demographic data and needed to merge the two datasets together.

For the modeling phase of the project I concentrated on clustering. The main idea behind k-means clustering is to minimize within-group variance while maximizing between-group variance. To begin, the k-means algorithm randomly assigns each observation into k groups and calculates the clusters center. The next two steps are done iteratively until convergence, [1] assign each observation to the group that has the closest center, and [2] after all observations have been reassigned, recalculate the group’s centers. Upon determining the value of k, we can finally run the kmeans algorithm in R. The kmeans function allows the user to specify centers, max iteration, and which particular algorithm to run, among others. We set centers, k, equal to 4 (identified via line chart), max iteration to 1000, and algorithm to MacQueen.

While clustering is extremely useful, the output can be hard to interpret. Initially we have no clue as to what these clusters mean or what they represent. To help with this problem I looked at the mean of all the variables in each cluster and try to define what each cluster is. Below is a breakdown of the 4 clusters identified:

Rural: 64% of the counties in the analysis fall into this cluster. This is the largest cluster and contains a high percentage of White and African American persons. The ethnicities with the lowest percentage are Hispanics and Asians, as well as the lowest foreign born percentage. Other characteristics of the cluster include the least educated, lowest median income, ~$37,000, and highest poverty level, 21.5%. This cluster has the highest percentage of Native Americans, 1.67%. Donald Trump has his highest winning percentage in this cluster, winning 74% of the counties in this cluster.

Rising Prosperity: 30% of the counties in the analysis fall into this cluster. This cluster maintains the highest percentage of white persons and low or lowest percentages of African Americans, Asians, and Hispanic persons. The homeowner rate is also the highest in this cluster than the others, while also having a high median household income and a low poverty rate. Donald Trump has his lowest winning percentage in this cluster, only winning 53% of the counties in this cluster.

Urban: 5% of the counties fall into this cluster. They have a good mixed proportion of the White, African American, Asian, and Hispanic ethnicities, as well as foreign-born voters. This cluster is where you will find the highest educated and wealthiest citizens (median household income, ~$55,000) in the dataset. Donald Trump has won 72% of the counties in the cluster.

Metropolitan: 1% of the counties fall into this cluster. This cluster has the largest percentages of young voters and African Americans Asian, and Hispanic ethnicities; as well as the highest percentage of foreign-born persons. This cluster also has a high poverty rate, 17%, and the lowest homeowner rate, 61.5%. Unsurprisingly, boasting the highest population density too. Donald Trump has won nearly 70% of the counties in the cluster.

Immediately, I notice some irregularities between my findings and the media. I was led to believe that Trump polls well with the older white population while not polling so well with other ethnicities. Yes, my findings tend to corroborate those reports that Trump polls well with the white population, winning 74% of the “Rural” cluster. However, he also has a high winning percentage in the “Metropolitan” cluster, which contains the highest percentage of young voters, African Americans, Asian, and Hispanics.

Womens Health Risk Assessment

Team Members: Weiwei G., Yang L., Sara M.
Program(s): SAS, R, Python
Model(s): Logistic Regression, Clustering, Neural Network, Decision Tree, Random Forests

Human Immunodeficiency Virus (HIV) is a virus spreading around the world. Per the National Institute of Allergy and Infectious Diseases (NIAID) nearly 37 million people have HIV and 1.2 million people are living with HIV in the United States. However, a large portion are unaware of their HIV discovery. To decrease the transmission of the diseases and improve women’s health, Microsoft announced the Women’s Health Risk Assessment competition that looks for machine learning solutions, in which Bill & Melinda Gates funded the competition. The focus of the competition is in the underdeveloped areas of the world, including most of Africa and India. In association with the World Health Organization (WHO) they developed a survey that reached 9,000 women in clinics and approximately 1,000 subjects in each region. The purpose of this study is to help clinics target the disease at risk and plan health care budget more efficiently so that more women can be treated in underdeveloped countries. The problem is a binary classification problem, low or high risk of HIV. As far as the expectation, researchers want to see what are the most important things that are associated with the infection of HIV among women ages 15-44.

For us to build a relatively accurate model, we needed to find out which features that matter the most. To do it properly, we performed a random forest model on our dataset. The random forest model in Python scikit-learn package offers a way to see the importance of each independent variable relative to the dependent variable. The most important features revealed from this analysis are modcon, tribe, district, babydoc, labordeliv, geo, segment, debut, usecondom, and hivknow. Using these variables, among others, we decided to develop several different models including logistic regression, decision tree, random forests, and a neural network. The takeaways from this, along with model creation and validation, is that individuals with access to modern technologies (computers, radios, irrigation, etc.) and using modern contraception and/or condoms will lower the risk of HIV. Other factors that can radically influence the risk is the religion they follow, education, literacy, and income.

We compared the models using mean square error (MSE) on validation data to select the best model for the project. Considering we used variable clustering for the predictor selection prior to logistic regression, decision tree, and neural network, we used SAS e-Miner to compare these three models, and the program selected the neural network to be the best model with the MSE of 0.12. However, compared with the MSE generated using cross-validation from ridge regression and random forest, we believe that random forest is the best model to predict HIV subgroup in this project with a MSE of 0.07.

Finally, our group wanted to look at some sort of unsupervised learning to experiment with. The dataset lends itself well to be segmented especially considering that it contains a myriad of health, demographic, and socio-economic features. The clustering analysis was done through R, and we used the k-means approach. After consulting the within sum of squares “elbow” plot, k was determined to be set at 3. Looking at the means of each cluster it becomes clear that there are differences among them.

Cluster 1: Educated Young Women - This cluster has the youngest average age of all the other clusters, 18. Contains the lowest percentage of respondents that responded yes to ever_had_sex and ever_been_pregnant, which leads to the smallest ratio of the clusters that currently have a child. Women in this cluster are educated as well, 31% of them are currently in school, they are literate and are the most educated of the three clusters. Lowest risk of HIV infection and mostly christian or muslim.

Cluster 2: Sexually Active Mid 20s Women - As the name says, this cluster’s average age is 26 and they contain the highest percentage of women that have had sex, been pregnant, and had multiple partners. Lowest percentage (6%) in school, which could be explained since they are older and may be out of school now. They are still literate and moderately educated. Highest HIVknow status, and most likely to be married and/or have children. Highest risk of HIV infection and mostly christian or muslim.

Cluster 3: Other - This cluster is all over the place, there is no proper way to describe them. A couple things to note is that they are mostly muslim and hindu; and they are the least educated. However, they are more likely to own a cooker or fridge compared to the other clusters.

Hire Heroes USA Data Challenge

Coming Soon!

Assessing the Innovative Characteristic of US Patents

Team Members: Marcia P.
Program(s): R, Python, PySpark, Hadoop
Model(s): Clustering

For more detail visit my github

This project was perhaps the most fun and the most difficult of all the projects that I completed during my time at university. It was my first venture into Spark and coding up a script to run autonomously based on a single command. The domain was also interesting, my teammate, for all intents and purposes, was an expert in the field of patents; she had performed numerous other projects on the subject. The reasoning behind this assignment was that in 2013, the USPTO started transitioning from a legacy patent classification system (the US Patent Classification System or USPC system) to the new Cooperative Patent Classification (CPC) system.

From 2011 to 2014, Dr. Deborah Strumsky, and others, did significant research to assess the level of patent innovation based on the USPC codes assigned to a patent. This project repeated a small portion of her work, to see if the new CPC system can also be used to assess patent innovation. The CPC’s hierarchic nature and the diversity of CPC sections, classes, and subclasses unique to each patent to could explain what type of innovative diversity might be driving n-tuple trends. In addition, we applied k-means clustering to 3 million patents to see if patents can be clustered based on the quantity and diversity of patent codes assigned to each patent. This technique and metrics could be used to assess the technological or innovative diversity of patents.

A Primer on CPC Code Hierarchy
Sections, there are just seven 1-digit “sections” in the CPC system. Class, these are 3-digit codes. Sub-class, there are 674 sub-classes in the CPC system, these are 4-digit codes. Main-group is defined as the 8-digit code, everything before the “/”. Subgroups, or n-tuple, are the complete 14-digit codes, of which there are 250k.

Example of an entire code: B64C 13/02.
Where B is the section: ex. performing operations; transporting. - 64 defines the class: ex. Aircraft, Aviation, Cosmonautics. - C defines the sub-class: ex. Aeroplanes, helicopters. - 13 defines the main group: ex. control systems or transmitting systems for actuating flying-control surfaces. - 02 defines the subgroup: ex. aircraft control/automatic/electric course control.

Each n-tuple is labeled as inventive (based on claims) or additive (providing additional information to the patent) and first or later. There is only one first n-tuple on each patent, which represents the one n-tuple that best represents the patent.

Results
Since 2008, our analysis showed that there had been a steady increase in the average number of unique 14-digit CPC codes assigned to patents. The increase in average number of unique 14-digit CPC codes is driven, overall, by using more maingroups, not more classes or subclasses, on each patent. This could be interpreted as meaning that patents are not combining technologies that are very diverse, but patents are increasingly combining claims related to technology subsegments that are fairly related to each other. Our analysis also showed that patents defined by 1-tuple or 2-tuples are declining significantly, and patents defined by 10+ tuples (i.e. patents with 10 or more unique CPC 14-digit codes assigned to them) are increasing significantly.

Our clustering results are presented below. A distributed clustering program, based on Lloyd’s algorithm, was coded to be ran in PySpark. The performance of our distributed k-means clustering program was evaluated by confirming our cluster means in R.

Maingroup Driven Cluster (n-tuple of 96): These patents have unusually high total number of codes that describe the patent, and this high number of codes is driven by maingroups (where maingroups represents fairly related technology sub-sectors, like “aircraft power plants or engines” and “aircraft flight control systems”). – So the innovation in this cluster is driven by the combination of many, fairly related technologies within a specific technology domain.

High Diversity Cluster (n-tuple of 32): This cluster has the broadest use of classes, 2.7, and subclasses, 3.6, and fairly robust use of n-tuples. Thus, this cluster of patents represents innovation that combines technologies from the broadest/most diverse set of technology domains.

Average Cluster (n-tuple of 11): This cluster represents patents in the middle of our “narrow to diverse” technology scale.

Narrow Cluster (n-tuple of 3.35): These patents incorporate the narrowest (least diverse) set of technologies. Only about three codes in total used on the patent, and all three of those are from just one or two classes or subclasses. So patents in this cluster are focused mostly within one technology domain.