python - LDA gensim. How to update a Postgres database with the correct topic number for every document? -
i taking different documents database , check lda (gensim), kind of latent topics there in these documents. works pretty well. save in database every document probable topic. , not sure best solution it. could, example, @ beginning extract unique id of every document database text_column , somehow process know @ end id belongs topic number. or may should in last part, print documents , topics. don't know how connect database. comparison of text_column document , assigning corresponding topic number? grateful comment.
stop = stopwords.words('english') sql = """select text_column table nullif(text_column, '') not null;""" cur.execute(sql) dbrows = cur.fetchall() conn.commit() documents = [] in dbrows: documents = documents + list(i) # remove words stoplist , tokenize stoplist = stopwords.words('english') additional_list = set("``;''".split(";")) texts = [[word.lower() word in document.split() if word.lower() not in stoplist , word not in string.punctuation , word.lower() not in additional_list] document in documents] # remove words appear less or equal of 2 times all_tokens = sum(texts, []) tokens_once = set(word word in set(all_tokens) if all_tokens.count(word) <= 2) texts = [[word word in text if word not in tokens_once] text in texts] dictionary = corpora.dictionary(texts) corpus = [dictionary.doc2bow(text) text in texts] my_num_topics = 10 # lda lda = ldamodel.ldamodel(corpus, id2word=dictionary, num_topics=my_num_topics) corpus_lda = lda[corpus] # print contributing words selected topics top in lda.show_topics(my_num_topics): print top # print probable topic , document l,t in izip(corpus_lda,documents): selected_topic = max(l,key=lambda item:item[1]) if selected_topic[1] != 1/my_num_topics: selected_topic_number = selected_topic[0] print selected_topic print t
as wildplasser has commented, had select id text_column. trying before, way appending data list, not suitable further processing. code below works , result creates table id , number of probable topic.
stop = stopwords.words('english') sql = """select id, text_column table nullif(text_column, '') not null;""" cur.execute(sql) dbrows = cur.fetchall() conn.commit() documents = [] in dbrows: documents.append(i) # remove words stoplist , tokenize stoplist = stopwords.words('english') additional_list = set("``;''".split(";")) texts = [[word.lower() word in document[1].split() if word.lower() not in stoplist , word not in string.punctuation , word.lower() not in additional_list] document in documents] # remove words appear less or equal of 2 times all_tokens = sum(texts, []) tokens_once = set(word word in set(all_tokens) if all_tokens.count(word) <= 2) texts = [[word word in text if word not in tokens_once] text in texts] dictionary = corpora.dictionary(texts) corpus = [dictionary.doc2bow(text) text in texts] my_num_topics = 10 # lda lda = ldamodel.ldamodel(corpus, id2word=dictionary, num_topics=my_num_topics) corpus_lda = lda[corpus] # print contributing words selected topics top in lda.show_topics(my_num_topics): print top # print probable topic , document lda_topics = [] l,t in izip(corpus_lda,documents): selected_topic = max(l,key=lambda item:item[1]) if selected_topic[1] != 1/my_num_topics: selected_topic_number = selected_topic[0] lda_topics.append((selected_topic[0],int(t[0]))) cur.execute("""create table table_topic (id bigint primary key, topic int);""") j in lda_topics: my_id = j[1] topic = j[0] cur.execute("insert table_topic (id, topic) values (%s, %s)", (my_id,topic)) conn.commit()
Comments
Post a Comment