Research Plan Redundant Programming Code



Download 241.78 Kb.
Page2/2
Date05.05.2018
Size241.78 Kb.
#48295
1   2


1 Introduction

During the period of 2002-2012, we programmed a number on complex, fast changing computer games. During those projects we realized that some ways of writing the code, increased our ability to maintain the code. Different types of code and different best practices led to code that are harder or easier to maintain.



In our view a good best practice’s increased the programmer’s access to reliable information. The code, in combination with the programming tool (editor, compiler, etc.) should “tell” the programmer how the code is intended to work and how it can be adapted.
Code duplications are generally shunned by programmers. A key principle for programmers is that "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system"[And99]. This is generally referred to as the “don´t repeat yourself” or DRY principle[And99]. It is easy understand how DRY could improve maintenance, since every duplication need to be separately understood, regarded, or disregarded by the programmer. For example: If a duplicated part of the code contains an error, the error must be fixed in multiple places.
A large part of programming code consists of identifiers. For example: The code of the Eclipse1 editor contains 2 million lines of code of which 72% of the character content consisted of identifier names[Dei05]. The concept of “identifier names” includes variable names as well as class and method names. Variables have types in object oriented programming languages. Types are classes or constructs that are encapsulating domain information.
It is important to know the type of a variable to understand what a program does. Before we had editors that could derive type information for us, it was quite common to include the type information as part of the variable name in the form of a standardized prefix[Sim99]. But in recent years, in languages with explicit typing, type information can be derived from each variable by the editor (IDE, integrated development environment). Thus the reasons for including type name information in variable names are no longer apparent.

1.1 Statement of the Problem


We have observed that code contains a lot of redundant information. This duplicated information mostly consists of type name information. By introducing the type information also in the variable name we are violating the DRY principle. We have also observed that type information redundancy seems to be quite common in variable name identifiers and not only in our own code.
It is unclear what the effect of this is on maintainability or productivity. The programming literature says that a variable should be named “[…so] the name fully and accurately describe the entity the variable represents”[Ste04] or to describe the “role” of the instance in the current context. By using the type name as variable name programmers miss the opportunity to provide more information. The code may also be harder to refactor since a renaming of a type also implies renaming of all instance variables that duplicate that type information. Redundancy may also be good as a link between information or to remind the programmer of the type.
In this study we will investigate the information content of variable identifiers with the aim to find redundancy. We also aim to find the effects of having, or not having type information in variable names.

1.2 Previous Research


Identifier names have been investigated in a number of studies. Some are focusing on syntactical buildup of identifiers. For example: If the identifiers are built up out of acronyms, or abbreviations, or whole words and the effect of these different types on readability[McC04]. Not surprisingly, a study reports that identifiers consisting of whole words give the best understandability[Law06]. Consequently there are reports on tools that help programmers translate from abbreviations to whole word identifiers[Cap00]. De Luca et al investigate how the introduction of a lexicon to improve the vocabulary and avoid multiple terms for the same concepts [DeL11]. There are also studies trying to link areas in code with flawed naming of identifiers with other code defects found[But09]. In contrast we are focusing on the information content (meaning) of variable names and not the syntax.
Studies on code duplication or “code clones” detects clones in different ways, for instance by matching text with tools like Duplo2. Link clones to errors[Bet09] or to areas of code more or less frequently changed[Kei12]. Our study differs in the way that we are not looking for a clone of code but for duplicated information. As such our study is much finer grained.
Despite our best efforts we have not found any studies on redundancy between identifier names and their types.

1.3 Significance


This work is significant for programmers who develop and maintain code. It is also important for educating students in how to effectively store information in code.
The potential to uncover ambiguously or bad naming automatically, can also be beneficial for writing new programming tools or languages.

1.4 Purpose


The purpose of this study is to find ways of writing code with less redundancy and high maintainability without sacrificing productivity.

1.5 Delimitations


This study will be focused on type and variable names as well as to a limited set of programming languages. We will limit the study to open-source projects and to studies conducted on students. We also limit ourselves to implicit information (naming and comments) but with the intention to expand the study to explicit (rules and constraints) information storage after lic.

1.6 Research Questions and Hypothesizes


Here I state the current research questions and working hypothesizes. These are subjects of constant change. Please note that every hypothesis has a null hypothesis stating no effect or no difference.
RQ1. What effect does a coding practice for storing information have on code maintainability? This question is important for judging if time should be invested in implementing a coding practice or not.

  • HA. Writing comments on arguments does have a significant effect on productivity on a maintenance task.

  • HB. Writing type safe arguments does have a significant effect on productivity on a maintenance task.

RQ2. How is information stored in source code?

This question is important to understand storage and retrieval of information. We have multiple ways of storing information, implicit and explicit. This study intends to focus on the implicit ways of storing information.
RQ2.1. What kind of information is stored in variable names?

This question is important to see alternatives to type information and to rank them. Also to understand what information could be missing.


I have 4 sub-questions regarding type information:

RQ2.1.1. How large quantity of variable names is dedicated to type information? We have seen that our code is full of redundancy; do other programmers do the same?



  • HA. Type information is more common for instances of classes than for primitive data types.

  • HB. Type information is more common for fields than parameters and local variables.

  • HC. The amount of type information varies largely between programmers.



RQ2.1.2. Why is type information included in variable names?

  • HA. A language with implicit typing has a significantly larger part of the variable name dedicated to type information.

  • HB. A language with implicit typing has a significantly larger part of the comments name dedicated to painted type information.

  • It could be that it’s just there for historical reasons.
    HC. Old code has more type information.



RQ2.1.3 How much of variable names for primitive types can be considered “painted” type information? A painted type is an implicit type with behavior but the programmer uses a more generic type instead. How can painted types be automatically classified? This is important since we suspect painted types to be a large reason for unnecessary and hard to detect bugs. If replaced with explicit types, we may improve maintainability and get explicit type-checks.

  • HA. A painted type has more comments on its value-range.

  • HB. A painted type has constraints, asserts or if-statements.



RQ2.1.4. When removing type information from a variable, what is left?
Here I expect a classification system for variable information content.

  • What can be considered type information? Anything that can be derived? Scope, Plural for arrays etc.? Synonyms?

  • What names are strongly associated with each class of information?

RQ2.2. What is the effect of redundant type information on maintainability and understandability? Are programmers missing an opportunity to introduce more domain or problem specific information? Is this information stored elsewhere?



  • HA. Variables with type names have more information in for instance comments.

  • HB. Implicit redundancy is used to relate(link) new information

  • HC. Implicit redundancy is used to repeat information to improve readability/understandability

  • HD. Implicit redundancy is used to conform to a standard.

RQ2.3. Is the information stored and retrieved the same?



  • HA. Programmers interpret variable names in different ways.

  • HB. Domain knowledge improves the ability to retrieve information.


3. Methods




3.1 Mixed Method Research Design


This study will be of a mixed research design. The study started with a quantitative experiment leading to set of mixed method studies. With large repositories of open source code we can use quantitative analysis to answer questions by measuring different metrics. But to understand what these metrics mean and not mean and validate them we need to capture the experience of programmers.

3.1.1 Experiment: Code Quality


This experiment ran during autumn 2013. In this experiment the effect of two programming best practices were investigated. The hypothesis tested was that good best practices should lead to high quality code, and high quality code should be easier to maintain (RQ1, HA, HB).

The experiment investigated two programming best practices but found no difference in maintainability between using and not using the practices. When searching for answers we found that the effects of the best-practices probably were negated due to high degree of redundant information in variable names, comments and type names. This redundancy may seriously reduce the effects of the best practices.


Artifacts:

  • 2*2 factorial experimental dataset

  • Result: No observed effect, unable to reject any of the null hypothesizes for HA and HB.

  • Hopefully poster at


3.1.2 Observational Study: Type names in variable identifiers


This quantitative study is done to find out how common type name redundancy is in programming code (RQ2.1.1). The study focuses on open source Java and PHP code. The goal is to compare the two languages and see if our prediction that implicitly typed PHP code contains more type information that the explicitly typed Java code (RQ2.1.2) is correct.

This is a quantitative study. Data will be collected from open source projects. The idea is also to collect data for future studies of RQ2.1.3 and RQ2.1.4.

The study also produces a tool for measuring type name duplication in variable names. As an example result the code in experiment 3.1.1 had over 70% type name duplication in its non-primitive variable names.

Timeframe: VT14
Artifacts:


  1. Paper on type name redundancy

  2. Random Mining tool for GitHub projects. Extracts random Java files from the 400K Java projects on GH. Enabling quantitative research on the population of “open source Java projects”.

  3. Static type name and variable name extractor tool for Java

    1. Extracts type name, scope and variable name information from single Java files

  4. Dynamic type name extractor tool for PHP

    1. Extracts type name, scope and variable name information from running PHP projects.

  5. Type name redundancy finder tool.

    1. Given type name and variable name and scope it answers the question: What information is unique or redundant?

  6. Java 800K variable type, scope and name dataset.

  7. PHP variable type and name dataset.



      1. Observational Study: Redundant Comments


This study focuses on RQ2. How is information stored in, and retrieved from code?

The study should focus on what kind of information is stored in comments? How much of comments are redundant?


The study started VT2014 but is on hold since the master student is on vacation.

      1. Experiment: Effect of redundancy


This experiment planned for the autumn 2014 is intended to show the effects or lack of effects of redundancy in code. Using the experiences from studies 3.1.1, 3.1.2 and 3.1.3 we intend to test hypothesizes on type name redundancies. It is intended to complement the 3.1.1 study.

      1. Qualitative Study: Code is ambiguous


This qualitative study is intended to understand why programmers store certain information in variable names. The main focus is to categorize different type of information stored and to see if different programmers interpret the same information in different way. Focus lies on RQ2.1.4, RQ2.2, and RQ2.3.

The categorization part has begun, but the study is intended for VT2015.



3.2 Setting


The research will be conducted in an academic environment. We will do experiments with programming students. The research results are intended to be applied in programming education. Experiments are done on students. Primary our students but hopefully also include students from other universities.
To get more external validity we will conduct measurements on open-source code but hopefully also closed source.

3.3 Researchers role and prerequisites


My background is in programming computer games, and teaching programming to students. With over 20 years of programming experience I have a lot of experience of succeeding and failing projects. My experience is that productivity and quality are often opposed to each other. As a researcher I would like to find ways of programming that are lightweight and productive, but at the same time produces understandable code of high quality. My role is to ask questions and construct methods to answer them.


    1. Methods for collecting and analyzing quantitative data


Experiments will be used to test hypothesizes[Woh00]. Data from these will be collected using online questionnaires, file uploads and custom experiment tools.

The observational studies will be using customized tools for code analysis. These tools will be published as Open Source to improve reproducibility. Data is collected from Open Source repositories such as GitHub3.

Data will be collected into Microsoft Excel documents. For statistical analysis we will use Excel, google spreadsheets, and R. For hypothesis testing we will use statistical methods depending on data distributions.

    1. Methods for collecting and analyzing qualitative data


Qualitative data will be collected using online questionnaires and recorded interviews. Qualitative data will be manually categorized and analyzed to improve theory and to make new hypothesizes.

    1. Methods for collecting and analyzing mixed methods data


The quantitative methods will be used to collect, and analyze large amounts of code, but also to test hypothesizes. The qualitative studies are intended to answer the questions; why and how programmers do things? This information will be used to form models and improve models and hypothesizes. The mixed method design is thus both explanatory and convergent[Cre14].


    1. Methods for testing validity and reliability


The experiments will be constructed to minimize the validity threats. However, our experience is that with student subjects and small scale experiments we will have to count on severe problems with both internal and external validity. A good overview over common threats to validity in experiments is provided by Wohlin et al.[Woh00].

To counter the validity threats of experiments, multiple, varied sources of information will be combined for triangulation.



    1. Ethical Considerations


Using of students as subjects in experiments during courses, call for an ethical discussion. There are several ethical dilemmas that limit the possibilities of experiments as part of a course. Of course the experiments must be conducted in line with laws as well as good science[Gus11].

  • Collected data must be handled with care.

  • It must be voluntary to participate.

  • Subjects must be informed about the experiment.

If the experiment is a part of a course, we might get conflicting goals between running the course and running an experiment.



  • The student as a subject must get something out of the experiment in terms of experience, knowledge or a reward.

  • There must be a clear separation between experiment and course examination.

  • A student must be able to pass the course even if he did not do the experiment.

Using students as subjects also affects the external validity of the study. However, in a study by Höst et al. found only “minor differences” between students and professional developers[MHö00]. But can we really expect our second year students to represents mature programmers? I’m not so sure. Many Software Engineering studies are combining results from students and professional programmers to improve the external validity.



3.9 Expected Results


The expected result of this study is a series of papers giving answers to the research questions.

The study should also result in series of research tools.



3.10 Reflection


The purpose of this study is to give in-depth understanding on how programmers store and retrieve information in source code. This was inspired by personal programming experiences and from discussions with other programmers and students. The DRY programming principle is highly regarded by programmers but we constantly find ourselves doing repetitive tasks, and introducing redundancies.
Planning this study has been very useful for me in many ways. I have introduced more structure and thinking of the goal of the study as well as taking time to reflect on bias and validity.

As the research questions of this study shows I’m constantly asking new questions and formulate new hypothesizes. These questions are not always possible to be answered in a study.



4. References


KoA06: , [1],

And99: , [2],

Dei05: , [3],

Sim99: , [4],

Ste04: , [5],

McC04: , [6],

Law06: , [7],

Cap00: , [8],

DeL11: , [9],

But09: , [10],

Bet09: , [11],

Kei12: , [12],

Woh00: , [13],

Cre14: , [14],

Gus11: , [15],

MHö00: , [16],



[1]

A. J. Ko, B. A. Myers, M. J. Coblenz and H. H. Aung, "An Exploratory Study of How Developers Seek, Relate, and Collect Relevant Information during Software Maintenance Tasks," IEEE Transactions on Software Engineering, pp. 971 - 987, 2006.

[2]

D. T. Andrew Hunt, The Pragmatic Programmer: From Journeyman to Master, Massachusetts: Addison Wesley Longman, inc., 1999.

[3]

F. Deissenböck and M. Pizka, "Concise and consistent naming," in IWPC 2005. Proceedings. 13th International Workshop on Program Comprehension, 2005., 2005.

[4]

C. Simonyi, Hungarian Notation, Microsoft Corporation, 1999.

[5]

S. C. McConnell, Code Complete, A Practical Handbook of Software Construction 2ED, Washington: Microsoft Press, 2004.

[6]

S. C. McConnell, Code Complete: A Practical Handbook of Software Construction, Second Edition, Washington: Microsoft Press, 2004.

[7]

D. Lawrie, C. Morrell, H. Feild and D. Binkley, "What's in a Name? A Study of Identifiers," in ICPC 2006. 14th IEEE International Conference on Program Comprehension, Athens, 2006.

[8]

B. Caprile and P. Tonella, "Restructuring program identifier names," in Proceedings. International Conference on Software Maintenance, 2000.

[9]

A. De Lucia, M. Di Penta and R. Oliveto, "Improving Source Code Lexicon via Traceability and Information Retrieval," IEEE Transactions On Software Engineering, Col. 37, pp. 205-227, 2011.

[10]

S. Butler, M. Wermelinger, Y. Yu and H. Sharp, "Relating Identifier Naming Flaws and Code Quality: An Empirical Study," in WCRE '09. 16th Working Conference on Reverse Engineering, Lille, 2009.

[11]

N. Bettenburg, W. Shang, W. Ibrahim, B. Adam, Y. Zou and A. E. Hassan, "An Empirical Study on Inconsistent Changes to Code Clones at Release Level," in WCRE '09 Proceedings of the 2009 16th Working Conference on Reverse Engineering, Washington, DC, 2009.

[12]

K. Hotta, Y. Sasaki, Y. Sano, Y. Higo and S. Kusukoto, "An Empirical Study on the Impact of Duplicate Code," Advances in Software Engineering, vol. 2012, no. Volume 2012, p. 22, 2012.

[13]

C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell and A. Wesslén, Experimentation in Software Engineering, An Introduction, Massachusetts: Kluwer Academic Publishers, 2000.

[14]

J. W. Creswell, Research Design, Qualitative, Quantitative, & Mixed Methods Approaches. 4th ED, London: Sage Publications, Inc, 2014.


[15]

B. Gustavsson, G. Hermerén and B. Pettersson, "God Forskningssed," Vetenskapsrådet, Stockholm, 2011.

[16]

M. Höst, B. Regnell and C. Wohlin, "Using Students as Subjects - A Comparative Study of Students and Professionals in Lead-Time Impact Assessment," Empirical Software Engineering: An International Journal, Vol 5, pp. 201-214, 2000.




1 https://www.eclipse.org/

2 http://duplo.sourceforge.net/

3 https://github.com/


Download 241.78 Kb.

Share with your friends:
1   2




The database is protected by copyright ©ininet.org 2024
send message

    Main page