Yelp Data Challenge

Student Group: Northwestern University Master of Science in Analytics (MSiA)
Team Members: Vincent Wang
Client: Yelp

Image of BP Project

1. Purpose and Objectives

Using NLP to do user rating prediction based on customer reviews,
Classification model to predict the successfulness of the rating (5 and no5).
Cluster users review key words to find out the hidden user ‘preference.
Understand the customer emotions through sentiment analysis (token coefficient from logistic regression)
Propose a recommendation systems.

2. Dataset

The Challenge Dataset:
- 4.1M reviews and 947K tips by 1M users for 144K businesses
- 1.1M business attributes, e.g., hours, parking availability, ambience.
- Aggregated check-ins over time for each of the 125K businesses
- 200,000 pictures from the included businesses
Cities:
- U.S.: Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison, Cleveland
Files:
- yelp_academic_dataset_business.json
- yelp_academic_dataset_review.json

3. Data Cleansing

Load the business and review data into pandas data frame.
Sampling
- Create filters that selects the most interesting business that are located in “Las Vegas” that contains “Restaurants” in their category (You may need to filter null categories first)
Keep relevant columns
- business_id
- name
- categories
- stars
Then, load the review data:
Join two table and set id as index
We further filter data
- by date, e.g. keep comments from last 2 years
- Otherwise your laptop may crush on memory when running machine learning algorithms
- Purposefully ignoring the reviews made too long time ago

4. NLP

Define your feature variables
- here is the text of the review
- “The food was decent, but the service was definitely sub par (especially for the price!).It took 15-20 minutes for the server to acknowledge us just so that we order drinks. Our dinner was good but I have to say that I was disappointed that they did nothing to acknowledge my birthday celebration. We chose this restaurant over STK & CUT to celebrate the occasion and I wish we hadn’t. They did nothing to acknowledge our occasion which was sad. I will not be going back to this restaurant again.”
Text Processing
- The first thing we need to do is process our text. Common steps include:
Lower all of your text 
Strip out misc. spacing and punctuation
Remove stop words
- Stop words are words which have no real meaning but make the sentence grammatically correct.
- Words like ‘am’, ‘the’, ‘my’, ‘to’, & c. NLTK contains 153 words for the English set of stop words These can also be domain specific.
Stem/Lemmatize our text
- he goal of this process is to transform a word into its base form.
- e.g. “ran”, “runs” -> “run” You can think of the base form as what you would look up in a dictionary Popular techniques include stemming and lemmatization.
- Stemming removes the suffix whereas
- Lemmatization attempt to change all forms of the word to the same form.
- Stemmers tend to operate on a single word without knowledge of the overall context. These are not perfect, however (e.g. taking the lemma of “Paris” and getting “pari”)
Text Vectorization Term Frequency
- We must convert our text data into something that these algorithms can work with by converting our corpus of text data into some form of numeric matrix representation.
- The most simple form of numeric representation is called a Term- Frequency (https://en.wikipedia.org/wiki/Document-term_matrix) matrix
- Each column of the matrix is a word, each row is a document, and each cell represents the count of that word in a document.
Let’s start with the flow:
- Tokenization —-> Sentence segmentation —-> Stemming/Lemmatization —-> stop words —-> Bag of words/TFIDF
- Tokenization
  - ‘I’, ‘banked’, ‘on’, ‘going’, ‘to’, ‘the’, ‘river’, ‘bank’, ‘today’, ‘.’
- Document Creation
  - A document is a list of lists where each list is a list of strings that contains one token.
  - ‘I’, ‘banked’, ‘on’, ‘going’, ‘to’, ‘the’, ‘river’, ‘bank’, ‘today’, ‘.’
- Lower case
  - ‘i’, ‘banked’, ‘on’, ‘going’, ‘to’, ‘the’, ‘river’, ‘bank’, ‘today’, ‘.’
- Stemming/Lemmatization
  - ‘i’, ‘bank’, ‘on’, ‘go’, ‘to’, ‘the’, ‘river’, ‘bank’, ‘today’, ‘.’
- Remove stop words
  - ‘bank’, ‘go’, ‘river’, ‘bank’, ‘today’, ‘.’

Image of BP Project

Define your target variable
- any categorical variable that may be meaningful
- For example, I am interested in perfect (5 stars) and imperfect (1-4 stars) rating
- You may want to look at the statistic of the target variable
Let’s create training dataset and test dataset
Let’s get NLP representation of the documents
Similar review search engine
- Select the query review,
- Vectorise it to have the same format with other review pool.
- Calculate the cosine similarity with every review in the review pools.
- Cosine and Euclidian distance (Distance among documents)
- Sort the result and choose top 5 reviews have least distance.

Image of BP Project

Classifying positive/negative review
- Target: two group (favorite 5stars, unfavorite no 5 stars)
- Vector: review data (vector transformed)
- Metric: accuracy
Naive-Bayes Classifier
Logistic Regression Classifier
- Variable coefficient can show positive/negative influence to the model
- Positive: u’amazing’,u’best’,u’awesome’,u’perfect’,u’thank’,u’delicious’,u’highly’
- Negative: u’worst’,u’ok’,u’rude’,u’horrible’,u’bland’,u’slow’,u’terrible’
Random Forest Classifier
Clustering
- Cluster the review text data for all the restaurants
  1. Define your feature variables, here is the text of the review
  2. Define your target variable (any categorical variable that may be meaningful) (ratings, types)
  3. For example, I am interested in perfect (5 stars) and imperfect (1-4 stars) rating¶
  4. Cluster the text file based on other variables
    1. Get NLP representation of the documents
    2. Fit TfidfVectorizer with training data only, then tranform all the data to tf-idf¶
    3. k means clustering, 8 groups, similar words in each group.
- Cluster all the reviews of the most reviewed restaurant
  1. Let’s find the most reviewed restaurant and analyze its reviews
  2. Vectorize the text feature
  3. Define your target variable (for later classification use)

Image of BP Project

Recommendation Systems:
- Get business_id, user_id, stars for recommender
- Create utility matrix from records
- Convert to document, word like matrix
- recommendation system with graphlab
  - built the utility matrix
    - extract some and quantify slme features of the resturants.
  - Content-based recommender
    - extract some feature for recommender
  - item and categories
  - Popularity-based recommender