UNMC_Acronym_Vert_sm_4c
University of Nebraska Medical Center

Preparing Data In Excel

The members of CCORDA have expertise in design and planning of studies, including preparation of data collection forms and database creation. We encourage researchers to include CCORDA in all phases of the study from design to analysis and dissemination of results. Some researchers collect and enter their own data for analysis, which makes accurate data entry critical for the success of the study. We have prepared some helpful hints for entering data into an Excel Workbook for ease in statistical analysis.  

Microsoft Excel can be a useful platform to enter and maintain research study data. Researchers can use Excel's simple statistical and plotting functions to help gain insight into their data. However, most research projects require more extensive statistical techniques that can be most easily performed using additional statistical software packages such as SAS or SPSS software.

To have your data easily imported into a statistical software package, we have developed these guidelines for data entry into Excel.

Here is a good example of data entry into an Excel file followed by guidelines for data entry.

Guidelines for a Good Excel Data Sheet

Put variables in columns and observation in rows.
  • Include a unique identifying number for each case.
  • Be sure that each variable name is unique (no duplicate variable names).
Put variable names in the first row.
  • Variable names must start with a letter.
  • Do not include special characters (#, !, ?, %, etc.) or spaces in your variable names.
  • Choose readily recognizable names for variables - but not too long (<= 16 characters best).
Use a separate column for each piece of information.

Don’t enter data such as "120/80" for blood pressure. Enter systolic blood pressure as one variable and diastolic blood pressure as another variable. Don't enter data as "A,C,D" or "BDF" if there are three possible answers to a question. Include a separate column for each answer.

When entering dates (especially for years prior to 1930) include a 4 digit year.

Two digit years can cause problems for statistical software when reading data from Excel files. The best format for dates is mm/dd/yyyy, where mm is a 2 digit month, dd is a 2 digit day and yyyy is a 4 digit year.

Decide on "missingness" conventions.

Missing data can cause many problems. To enter a missing data value either enter a blank or an "impossible" numeric code (for numbers) or an easily recognizable single digit character code for character (trying to avoid mixing numeric and character data). Be sure, if you use a missing value code, that it cannot be confused with a "real" data value.

Be consistent in your data entry.

When entering data keep the same format throughout.

Good Example

The date variable has the same format (mm/dd/yyyy) and the sex variable is consistent throughout in both case and type (character variable).

ID 

DOB 

Sex 

12/31/1976 

01/01/1977 

01/02/1977 

01/03/1977 

01/04/1977 

Bad Example

The date variable is in different formats without a 4-digit year for all the observations. The sex variable is still a character variable, but statistical software will read this variable as having six different levels instead of two.

ID 

DOB 

Sex 

12/31/1976 

1-Jan-77 

01/02/1977 

Female 

01/03/77 

01/04/1977 

Male 

Other Guidelines

Use only one worksheet for your data.

If you decide to use multiple sheets for your data, follow the variable naming conventions for the tabs that name the sheets (keep the names simple and unique).

Do not "stack" data on the same sheets.

For example, treated versus non-treated patients can be handled by column variable that has a code for Treated (yes/no).

Avoid using "special" Excel features.

Special features include hidden columns, graphs on the data sheet that is your primary database, colors, italics, or bold. These features can be used on other separate "subset" or "analysis" spreadsheets that are for the investigator, but not the statistician or programmer.

Document your database with a data dictionary and/or codebook.

Documenting your database will help the statistician, programmer, data manager, and yourself, the researcher, understand your data and database. It is a good idea to document what your variables are and what they mean. The data dictionary should include all of the variable names, data type that corresponds to the variable, a label or longer name that describes the variable including the units it is measured in, the codes for any categorical variables, and any notes for the variable. This can be a separate worksheet or document file. Here is an example of a data dictionary.

Ask the statistician, data analyst, or programmer.

Be sure the effort you are putting forth is necessary. The CCORDA member should be able to tell you precisely what form the data needs to be in to suit its conversion and analysis.

Data Dictionary

Variable Name 

Data Type 

Description 

Codes 

Notes 

ID 

Numeric 

Patient ID 

 

 

Treated 

Numeric 

Treatment group 

1=treated, 2=control 

 

Age 

Numeric 

Age (years) 

 

 

Sex 

Character 

Gender 

M=Male, F=Female 

 

Height 

Numeric 

Height (cm) 

 

Blanks=missing data 

Weight 

Numeric 

Weight (kg) 

 

Blanks=missing data 

Proc_date 

Date 

Procedure date 

 

 

systolic_BP 

Numeric 

Systolic blood pressure 

 

Blanks=missing data 

diastolic_BP 

Numeric 

Diastolic blood pressure 

 

Blanks=missing data