python - LDA gensim. How to update a Postgres database with the correct topic number for every document? -


i taking different documents database , check lda (gensim), kind of latent topics there in these documents. works pretty well. save in database every document probable topic. , not sure best solution it. could, example, @ beginning extract unique id of every document database text_column , somehow process know @ end id belongs topic number. or may should in last part, print documents , topics. don't know how connect database. comparison of text_column document , assigning corresponding topic number? grateful comment.

stop = stopwords.words('english')  sql = """select text_column table nullif(text_column, '') not null;""" cur.execute(sql) dbrows = cur.fetchall() conn.commit()  documents = []     in dbrows:     documents = documents + list(i)  # remove words stoplist , tokenize stoplist = stopwords.words('english')  additional_list = set("``;''".split(";"))  texts = [[word.lower() word in document.split() if word.lower() not                 in stoplist , word not in string.punctuation , word.lower() not in additional_list]       document in documents]  # remove words appear less or equal of 2 times all_tokens = sum(texts, []) tokens_once = set(word word in set(all_tokens) if all_tokens.count(word) <= 2) texts = [[word word in text if word not in tokens_once]      text in texts]  dictionary = corpora.dictionary(texts) corpus = [dictionary.doc2bow(text) text in texts] my_num_topics = 10  # lda lda = ldamodel.ldamodel(corpus, id2word=dictionary, num_topics=my_num_topics) corpus_lda = lda[corpus]  # print contributing words selected topics top in lda.show_topics(my_num_topics):     print top  # print probable topic , document l,t in izip(corpus_lda,documents):     selected_topic = max(l,key=lambda item:item[1])     if selected_topic[1] != 1/my_num_topics:         selected_topic_number = selected_topic[0]         print selected_topic         print t 

as wildplasser has commented, had select id text_column. trying before, way appending data list, not suitable further processing. code below works , result creates table id , number of probable topic.

stop = stopwords.words('english')  sql = """select id, text_column table nullif(text_column, '') not null;""" cur.execute(sql) dbrows = cur.fetchall() conn.commit()  documents = []     in dbrows:     documents.append(i)  # remove words stoplist , tokenize stoplist = stopwords.words('english')  additional_list = set("``;''".split(";"))  texts = [[word.lower() word in document[1].split() if word.lower() not                 in stoplist , word not in string.punctuation , word.lower() not in additional_list]   document in documents]  # remove words appear less or equal of 2 times all_tokens = sum(texts, []) tokens_once = set(word word in set(all_tokens) if all_tokens.count(word) <= 2) texts = [[word word in text if word not in tokens_once]  text in texts]  dictionary = corpora.dictionary(texts) corpus = [dictionary.doc2bow(text) text in texts] my_num_topics = 10  # lda lda = ldamodel.ldamodel(corpus, id2word=dictionary, num_topics=my_num_topics) corpus_lda = lda[corpus]  # print contributing words selected topics top in lda.show_topics(my_num_topics):     print top  # print probable topic , document lda_topics = [] l,t in izip(corpus_lda,documents):     selected_topic = max(l,key=lambda item:item[1])     if selected_topic[1] != 1/my_num_topics:         selected_topic_number = selected_topic[0]         lda_topics.append((selected_topic[0],int(t[0])))  cur.execute("""create table table_topic (id bigint primary key, topic int);""") j in lda_topics:     my_id = j[1]     topic = j[0]     cur.execute("insert table_topic (id, topic) values (%s, %s)", (my_id,topic))     conn.commit() 

Comments

Popular posts from this blog

Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12:test (default-test) on project.Error occurred in starting fork -

windows - Debug iNetMgr.exe unhandle exception System.Management.Automation.CmdletInvocationException -

configurationsection - activeMq-5.13.3 setup configurations for wildfly 10.0.0 -