TF-IDF로 장르, 영화 tag이용한 추천 알고리즘 실습

✔ etc.

TF-IDF로 장르, 영화 tag이용한 추천 알고리즘 실습

근본없는 개발자 2024. 5. 1. 21:57

0. 지난 내용

1) 컨텐츠기반 추천시스템
2) 컨텐츠를 활용할 수 있는지
3) 어떻게 컨텐츠들이 서로 연관이 있는지 평가하는 방법
4) 근접이웃기반 컨텐츠기반 추천시스템 -> k-nearest neighbor
5) naive bayes classifer -> 베이즈 추천시스템
6) TF-IDF 개념설명 그리고 실습

📌 추천 시스템 성능 평가 → RMSE

1. Dataset 불러오기

. 필요한 데이터들을 불러온다.

# 각자 작업 환경에 맞는 경로를 지정해주세요. Google Colab과 Jupyter환경에서 경로가 다를 수 있습니다.
path = '/content/drive/MyDrive/Colab Notebooks'

ratings_df = pd.read_csv(os.path.join(path, 'ratings.csv'), encoding='utf-8')
movies_df = pd.read_csv(os.path.join(path, 'movies.csv'), index_col='movieId', encoding='utf-8')
tags_df = pd.read_csv(os.path.join(path, 'tags.csv'), encoding='utf-8')

. 불러온 데이터들 조회

# user 가 moive에 대해 어떤 rating을 줬는지 조회
ratings_df.head()

# 장르 정보는 여기서 가져와야함
movies_df.head()

# user가 movie에 대해 어떤 tag를 줬는지
tags_df.head()

2. Genres를 이용한 movie representation

. movie가 몇개인지, 장르는 몇개인지..

→ 장르는 총 20개이고, 영화 수는 9742개임을 볼 수 있다.

total_count = len(movies_df.index)
total_genres = list(set([genre for sublist in list(map(lambda x: x.split('|'), movies_df['genres'])) for genre in sublist]))


print(f"전체 영화 수: {total_count}")
print(f"장르: {total_genres}")

print(len(total_genres))

. genre_count

→ 전체 영화 중 각 장르가 몇번 등장하는지를 보고자 함 (중복 포함)

genre_count = dict.fromkeys(total_genres)

for each_genre_list in movies_df['genres']:
    for genre in each_genre_list.split('|'):
        if genre_count[genre] == None:
            genre_count[genre] = 1
        else:
            genre_count[genre] = genre_count[genre]+1
            

genre_count

. np.log10

→ IDF에 해당되는 부분, 이를 통해 장르별 가중치를 계산해보겠단 의미

ex) 위쪽에서 등장 비중이 높았던 Comedy(3756번)는 가중치가 0.41 정도로 낮고,

등장 비중이 낮았던 IMAX(158번)는 가중치가 1.79 정도로 높다.

for each_genre in genre_count:
    genre_count[each_genre] = np.log10(total_count/genre_count[each_genre])

genre_count

. genre_representation을 생성한다

→ 이전에는 0과 1로만 표기를 했었는데, 이번에는 위에서 산출한 IDF로 표현을 한다.

→ 따라서 movie별 표기된 장르에서 중요도가 높은 장르와 중요도가 낮은 장르를 구분하게 된다.

# create genre representations

# 데이터 프레임을 우선 생성
genre_representation = pd.DataFrame(columns=sorted(total_genres), index=movies_df.index)

# MOVIE별로 반복문을 수행하게 된다
# tqdm을 걸어놔서 한줄씩 실행된다.
for index, each_row in tqdm(movies_df.iterrows()):
	
    # 해당되는 장르의 가중치를 가져온다
    dict_temp = {i: genre_count[i] for i in each_row['genres'].split('|')}
    
    # 가중치를 가지고 데이터 프레임을 새로 생성
    row_to_add = pd.DataFrame(dict_temp, index=[index])
    
    # 위 결과를 모든 장르가 들어가있는 genre_representation에 업데이트 한다
    genre_representation.update(row_to_add)

#결과를 보면, movie별로 해당되는 장르에 가중치가 들어간걸 볼 수 있다.
genre_representation

3. Tag를 이용한 Movie Representation

. user가 movie에 어떤 tag를 줬는지 조회

tags_df.head(5)

. movieId=89774가 어떤 영화인지 조회

movies_df.loc[89774]

. TF-IDF를 구하기 위해, 먼저 태그 컬럼 확인

→ 1589개의 unique한 태그를 가지고 영화들을 설명했다는걸 알 수 있다.

# get unique tag
tag_column = list(map(lambda x: x.split(','), tags_df['tag']))
unique_tags = list(set(list(map(lambda x: x.strip(), list([tag for sublist in tag_column for tag in sublist])))))

print(unique_tags)

print(len(tag_column))
print(len(unique_tags))

. tag의 IDF를 구한다.

→ tag_idf[each_tag] = np.log10(total_movie_count / tag_count_dict[each_tag])

→ 마찬가지로 많이 등장한 태그에 가중치를 덜 주기 위해 inverse를 취하게 된다.

. 위에서 봤듯이 unique한 tag에 대해 가중치를 부여했기 때문에 tag_idf도 1589개가 된다.

# Compute IDF for tag
total_movie_count = len(set(tags_df['movieId']))
# key: tag, value: number of movies with such tag
tag_count_dict = dict.fromkeys(unique_tags)

for each_movie_tag_list in tags_df['tag']:
    for tag in each_movie_tag_list.split(","):
        if tag_count_dict[tag.strip()] == None:
            tag_count_dict[tag.strip()] = 1
        else:
            tag_count_dict[tag.strip()] += 1

tag_idf = dict()
for each_tag in tag_count_dict:
    tag_idf[each_tag] = np.log10(total_movie_count / tag_count_dict[each_tag])

tag_idf

len(tag_idf.keys())	#1589

. movie id 별로 각각 해당되는 tag에 대한 가중치를 부여한다.

→ 많은 태그들 중 movie별로 적은 수의 tag가 달려있기 때문에 조회시 NaN이 많이 보인다. (잘못된것 x)

** tag_representation.sort_index() → 예제에는 파라미터로 0이 들어가있으나 제외해야한다.

# Create movie representations
tag_representation = pd.DataFrame(columns=sorted(unique_tags), index=list(set(tags_df['movieId'])))
for name, group in tqdm(tags_df.groupby(by='movieId')):
    temp_list = list(map(lambda x: x.split(','), list(group['tag'])))
    temp_tag_list = list(set(list(map(lambda x: x.strip(), list([tag for sublist in temp_list for tag in sublist])))))

    dict_temp = {i: tag_idf[i.strip()] for i in temp_tag_list}
    row_to_add = pd.DataFrame(dict_temp, index=[group['movieId'].values[0]])
    tag_representation.update(row_to_add)

tag_representation = tag_representation.sort_index()
tag_representation

. 위에서 만든 2개의 representation 결과 조회

print(genre_representation.shape) #(9742, 20) (영화 수, 장르 수)
print(tag_representation.shape)	#(1572, 1589) (태그를 단 영화 수, 태그 수)

4. Final Movie Represenation

. genre와 tag로 만들어진 representation을 합쳐서 각 movie의 vector로 만든다

# 장르와 태그 두개의 representation을 합치고, NaN은 필요가 없으니 0으로 치환한다.
movie_representation = pd.concat([genre_representation, tag_representation], axis=1).fillna(0)

# 장르 20개, 태그 1589
# 따라서 영화 하나를 표현하기 위해 1609개의 feature가 생겨남
print(movie_representation.shape)
print(movie_representation.describe())

5. Contents 유사도 평가

. Cosine similarity를 사용

→ sklearn(싸이킬런)에서 함수로 구현해놓음

from sklearn.metrics.pairwise import cosine_similarity

def cos_sim_matrix(a, b):
    cos_sim = cosine_similarity(a, b)
    result_df = pd.DataFrame(data=cos_sim, index=[a.index])

    return result_df

print(movie_representation.head())

. 각 movie와 해당 movie를 제외한 다른 모든 movie와의 Cosine 유사도를 구한다.

cs_df = cos_sim_matrix(movie_representation, movie_representation)
cs_df.head()

. 예시) 1번 영화에 대해서 Cosine 유사도로 줄을 세운 결과

→ 가장 유사한 것이 46972번의 영화이며, 0.32정도의 Cosine 유사도를 가지고 있다.

→ 1번 영화를 본 유저에게 46972번 영화를 추천할 수 있다.

print(cs_df.shape)
print(cs_df[1].sort_values(ascending=False))

→ 위 결과에서 관련도가 높은 영화 목록 조회해보

print(movies_df.loc[1])
print(movies_df.loc[46972])
print(movies_df.loc[126142])
print(movies_df.loc[2043])
print(movies_df.loc[2399])

6. 추천시스템의 성능 평가

. 학습셋과 테스트셋을 나눈다.
. 테스트셋에서 예측한 평점과 실제 평점의 RMSE를 구한다.

. ratings_df를 활용해서 train/test 데이터를 나눈다.

train_df, test_df = train_test_split(ratings_df, test_size=0.2, random_state=1234)

print(train_df.shape)
print(test_df.shape)

. 테스트에 사용할 user id

test_userids = list(set(test_df.userId.values))
test_userids

. rating정보와 consine 유사도를 가지고 최종 예측 진행

result_df = pd.DataFrame()

for user_id in tqdm(test_userids):
	# Train 데이터에서 user id가 해당 되는 부분을 가져온다.
    user_record_df = train_df.loc[train_df.userId == int(user_id), :]

	# (n, 9742); n은 userId가 평점을 매긴 영화 수
    # 해당 user를 위한 consine 유사도 데이터를 만든다.
    user_sim_df = cs_df.loc[user_record_df['movieId']]  
    
    # user_record_df에서 user_rating을 가져온다.
    # 유저가 평점을 매긴 영화 전체의 rating을 가져오는 것.
    user_rating_df = user_record_df[['rating']] # (n, 1)
    sim_sum = np.sum(user_sim_df.T.to_numpy(), -1)  # (9742, 1)
    
    # user rating 정보와 consine 유사도를 가지고 matrix multiplication을 통해 최종 예측한다.    
    prediction = np.matmul(user_sim_df.T.to_numpy(), user_rating_df.to_numpy()).flatten() / (sim_sum+1) # (9742, 1)

    prediction_df = pd.DataFrame(prediction, index=cs_df.index).reset_index()
    prediction_df.columns = ['movieId', 'pred_rating']
    prediction_df = prediction_df[['movieId', 'pred_rating']][prediction_df.movieId.isin(test_df[test_df.userId == user_id]['movieId'].values)]

    temp_df = prediction_df.merge(test_df[test_df.userId == user_id], on='movieId')
    result_df = pd.concat([result_df, temp_df], axis=0)

. 예측과 실제 비교

result_df.head(10)

. RMSE 결과

→ 성능이 드라마틱하게 좋아지지는 않았다.

→ 이유? 어떤 feature를 만드는지가 중요!

→ flow는 동일하되, feature를 어떻게 만들지를 수정해야 한다.

mse = mean_squared_error(y_true=result_df['rating'].values, y_pred=result_df['pred_rating'].values)
rmse = np.sqrt(mse)

print(mse, rmse)

저작자표시 비영리 변경금지 (새창열림)