Measuring similarity between Karel programs using character and word n-grams


Citar

Texto integral

Acesso aberto Acesso aberto
Acesso é fechado Acesso está concedido
Acesso é fechado Somente assinantes

Resumo

We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.

Sobre autores

G. Sidorov

Instituto Politécnico Nacional (IPN)

Autor responsável pela correspondência
Email: sidorov@cic.ipn.mx
México, Mexico City

M. Ibarra Romero

Instituto Politécnico Nacional (IPN)

Email: francisco.castillo@upq.mx
México, Mexico City

I. Markov

Instituto Politécnico Nacional (IPN)

Autor responsável pela correspondência
Email: markovilya@yahoo.com
México, Mexico City

R. Guzman-Cabrera

Engineering Division

Autor responsável pela correspondência
Email: guzmanc81@gmail.com
México, Guanajuato

L. Chanona-Hernández

Instituto Politécnico Nacional

Autor responsável pela correspondência
Email: lchanona@gmail.com
México, Mexico City

F. Velásquez

Polytechnic University of Queretaro

Autor responsável pela correspondência
Email: francisco.castillo@upq.mx
México, Queretaro

Arquivos suplementares

Arquivos suplementares
Ação
1. JATS XML

Declaração de direitos autorais © Pleiades Publishing, Ltd., 2017