On the use of statistical machine translation for suggesting variable names for decompiled code: The Pharo case

Sandoval Alcocer, Juan Pablo; Camacho-Jaimes, Harold; Galindo-Gutierrez, Geraldine; Neyem, Hugo Andrés; Bergel, Alexandre; Ducassee, Stéphane

On the use of statistical machine translation for suggesting variable names for decompiled code: The Pharo case

dc.catalogador	jlo
dc.contributor.author	Sandoval Alcocer, Juan Pablo
dc.contributor.author	Camacho-Jaimes, Harold
dc.contributor.author	Galindo-Gutierrez, Geraldine
dc.contributor.author	Neyem, Hugo Andrés
dc.contributor.author	Bergel, Alexandre
dc.contributor.author	Ducassee, Stéphane
dc.date.accessioned	2024-05-31T13:16:00Z
dc.date.available	2024-05-31T13:16:00Z
dc.date.issued	2024
dc.description.abstract	Adequately selecting variable names is a difficult activity for practitioners. In 2018, Jaffe et al. proposed the use of statistical machine translation (SMT) to suggest descriptive variable names for decompiled code. A large corpus of decompiled C code was used to train the SMT model. Our paper presents the results of a partial replication of Jaffe’s experiment. We apply the same technique and methodology to a dataset made of code written in the Pharo programming language. We selected Pharo since its syntax is simple – it fits on half of a postcard – and because the optimizations performed by the compiler are limited to method scope. Our results indicate that SMT may recover between 8.9% and 69.88% of the variable names depending on the training set. Our replication concludes that: (i) the accuracy depends on the code similarity between the training and testing sets; (ii) the simplicity of the Pharo syntax and the satisfactory decompiled code alignment have a positive impact on predicting variable names; and (iii) a relatively small code corpus is sufficient to train the SMT model, which shows the applicability of the approach to less popular programming languages. Additionally, to assess SMT’s potential in improving original variable names, ten Pharo developers reviewed 400 SMT name suggestions, with four reviews per variable. Only 15 suggestions (3.75%) were unanimously viewed as improvements, while 45 (11.25%) were perceived as improvements by at least two reviewers, highlighting SMT’s limitations in providing suitable alternatives.
dc.fechaingreso.objetodigital	2024-08-30
dc.fuente.origen	ORCID
dc.identifier.doi	10.1016/j.cola.2024.101271
dc.identifier.issn	2590-1184
dc.identifier.uri	https://doi.org/10.1016/j.cola.2024.101271
dc.identifier.uri	https://repositorio.uc.cl/handle/11534/86122
dc.identifier.wosid	WOS:001217002000001
dc.information.autoruc	Escuela de Ingeniería; Sandoval Alcocer, Juan Pablo; S/I; 1210748
dc.information.autoruc	Escuela de Ingeniería; Neyem, Hugo Andrés; 0000-0002-5734-722X; 1007638
dc.language.iso	en
dc.nota.acceso	Contenido parcial
dc.rights	acceso restringido
dc.subject	Statistical machine translation
dc.subject	Decompiled code
dc.subject	Identifiers
dc.subject	Variable names
dc.subject.ddc	000
dc.subject.dewey	Ciencias de la computación	es_ES
dc.title	On the use of statistical machine translation for suggesting variable names for decompiled code: The Pharo case
dc.type	artículo
sipa.codpersvinculados	1210748
sipa.codpersvinculados	1007638
sipa.trazabilidad	ORCID;2024-05-27

Files

Original bundle

Now showing 1 - 1 of 1

Name:: On the use of statistical machine translation for suggesting variable names for decompiled code - The Pharo case.pdf
Size:: 3.12 KB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Artículos de revistas