Tese

Mestrado • Mestrado em Ciência de Dados

What are people talking about in your region? Applying a topic modeling approach to Portuguese geolocated tweets

Repositório

Autor

Rosa, Érica Sofia Palmeirim Santos

Acesso

Acesso livre

Palavras-chave

Topic modeling

Modelagem de tópicos

Short text clustering

Dirichlet multinomial mixture

Latent Dirichlet allocation

Movie group process

Naive Bayes classifier

Twitter in Portugal

Tweets geolocation

Agrupamento de texto curto

Mistura multinomial de Dirichlet

Alocação de Dirichlet latente

Processo de grupo de filmes

Classificador Naive Bayes

Twitter em Portugal

Geolocalização de tweets

Resumo

Esta tese tem como principal objetivo identificar os tópicos acerca dos quais os portugueses falaram na rede social Twitter, durante os primeiros 6 meses de 2021, em cada região do país. Como fonte de dados, foi-nos possível obter, através da API do Twitter, uma base de dados de cerca de 1 milhão de tweets, escritos ao longo deste período, em todo o país. Tendo os dados disponíveis, foi nos possível, através da criação de um dicionário de palavras, atribuir a cada localidade do país mencionada na base de dados, uma região de NUTS nível 2, de forma a atribuirmos a cada Tweet apenas uma região por entre 5 regiões: Alentejo, Algarve, Centro, Lisboa ou Região Norte. De seguida, fomos analisar os modelos de modelagem de tópicos mais utilizados no momento atual e, em particular, quais os que têm demonstrado melhor performance quando aplicados a textos curtos, como acontece quando falamos de tweets. Após esta análise bibliográfica, optámos por aplicar à nossa base de dados, e avaliar a performace, dos modelos LDA- Latent Dirichlet Allocation e MM - Multinomial Mixture Model. Através da medição da coherência em ambos os modelos, conseguimos resultados mais satisfatórios na aplicação do modelo MM, selecionando então este modelo para aplicar à nossa base de dados. Com os tópicos já definidos e atribuídos a cada tweet, foi realizada uma análise por região e diária, dos tópicos mais referidos pelos portugueses. Conseguimos concluir que os temas mais falados em Portugal, considerando a amostra recolhida na rede social Twitter, são: a política, a religião e a fé, os jogadores de futebol e a comida e a cozinha. Por fim, fizémos então a análise de tópicos por região e por dia, por entre as nossas conclusões, concluímos que o tópico da comida e da cozinha se destacam no Algarve e no Norte, e que o tópico das eleições ganha predominância, no geral do país, entre o final do mês de Janeiro e meados do mês de Fevereiro.

The main objective of this thesis is to identify the topics that the Portuguese spoke about on the social network Twitter, during the first 6 months of 2021, in each region of the country. As a data source, we were able to obtain, through the Twitter API, a database of around 1 million tweets, written throughout this period, across the country. Having the data available, it was possible, through the creation of a dictionary of words, to assign to each locality of the country mentioned in the database, a region of NUTS level 2, in order to attribute to each Tweet only one region among 5 regions: Alentejo, Algarve, Centre, Lisbon or North Region. Next, we analyzed the most used topic modeling models at the moment and, in particular, which ones have shown better performance when applied to short texts. After this bibliographic analysis, we chose to apply to our database, and evaluate the performance, of the LDA- Latent Dirichlet Allocation and MM - Multinomial Mixture Model models. By measuring the coherence in both models, we achieved more satisfactory results in the application of the MM model, selecting this model to apply to our database. With the topics already defined and assigned to each tweet, an analysis was carried out by region and time period, of the topics most mentioned by the Portuguese. We were able to conclude that the most talked about topics in Portugal, considering the sample collected on the social network Twitter, are: politics, religion and faith, football players and food and cuisine. Finally, we then analyzed topics by region and by day, among our conclusions, was that the topic of food and cuisine stands out in the Algarve and in the North, and that the topic of elections gains predominance, in general in the country, between the end of January and the middle of February.

What are people talking about in your region? Applying a topic modeling approach to Portuguese geolocated tweets

Relacionadas

Mestrado em Sistemas Integrados de Apoio à Decisão

Mestrado em Sistemas Integrados de Apoio à Decisão

Mestrado em Engenharia Informática

Mestrado em Engenharia Informática

Política de Cookies