Preparing Data In Excel
The members of CCORDA have expertise in design and planning of studies, including preparation of data collection forms and database creation. We encourage researchers to include CCORDA in all phases of the study from design to analysis and dissemination of results. Some researchers collect and enter their own data for analysis, which makes accurate data entry critical for the success of the study. We have prepared some helpful hints for entering data into an Excel Workbook for ease in statistical analysis.
Microsoft Excel can be a useful platform to enter and maintain research study data. Researchers can use Excel's simple statistical and plotting functions to help gain insight into their data. However, most research projects require more extensive statistical techniques that can be most easily performed using additional statistical software packages such as SAS or SPSS software.
To have your data easily imported into a statistical software package, we have developed these guidelines for data entry into Excel.
Here is a good example of data entry into an Excel file followed by guidelines for data entry.
Guidelines for a Good Excel Data Sheet
- Include a unique identifying number for each case.
- Be sure that each variable name is unique (no duplicate variable names).
- Variable names must start with a letter.
- Do not include special characters (#, !, ?, %, etc.) or spaces in your variable names.
- Choose readily recognizable names for variables - but not too long (<= 16 characters best).
Don’t enter data such as "120/80" for blood pressure. Enter systolic blood pressure as one variable and diastolic blood pressure as another variable. Don't enter data as "A,C,D" or "BDF" if there are three possible answers to a question. Include a separate column for each answer.
Two digit years can cause problems for statistical software when reading data from Excel files. The best format for dates is mm/dd/yyyy, where mm is a 2 digit month, dd is a 2 digit day and yyyy is a 4 digit year.
Missing data can cause many problems. To enter a missing data value either enter a blank or an "impossible" numeric code (for numbers) or an easily recognizable single digit character code for character (trying to avoid mixing numeric and character data). Be sure, if you use a missing value code, that it cannot be confused with a "real" data value.
When entering data keep the same format throughout.
Good Example
The date variable has the same format (mm/dd/yyyy) and the sex variable is consistent throughout in both case and type (character variable).
ID |
DOB |
Sex |
1 |
12/31/1976 |
F |
2 |
01/01/1977 |
M |
3 |
01/02/1977 |
F |
4 |
01/03/1977 |
F |
5 |
01/04/1977 |
M |
Bad Example
The date variable is in different formats without a 4-digit year for all the observations. The sex variable is still a character variable, but statistical software will read this variable as having six different levels instead of two.
ID |
DOB |
Sex |
1 |
12/31/1976 |
f |
2 |
1-Jan-77 |
m |
3 |
01/02/1977 |
Female |
4 |
01/03/77 |
F |
5 |
01/04/1977 |
Male |
Other Guidelines
If you decide to use multiple sheets for your data, follow the variable naming conventions for the tabs that name the sheets (keep the names simple and unique).
For example, treated versus non-treated patients can be handled by column variable that has a code for Treated (yes/no).
Special features include hidden columns, graphs on the data sheet that is your primary database, colors, italics, or bold. These features can be used on other separate "subset" or "analysis" spreadsheets that are for the investigator, but not the statistician or programmer.
Documenting your database will help the statistician, programmer, data manager, and yourself, the researcher, understand your data and database. It is a good idea to document what your variables are and what they mean. The data dictionary should include all of the variable names, data type that corresponds to the variable, a label or longer name that describes the variable including the units it is measured in, the codes for any categorical variables, and any notes for the variable. This can be a separate worksheet or document file. Here is an example of a data dictionary.
Be sure the effort you are putting forth is necessary. The CCORDA member should be able to tell you precisely what form the data needs to be in to suit its conversion and analysis.
Data Dictionary
Variable Name |
Data Type |
Description |
Codes |
Notes |
ID |
Numeric |
Patient ID |
|
|
Treated |
Numeric |
Treatment group |
1=treated, 2=control |
|
Age |
Numeric |
Age (years) |
|
|
Sex |
Character |
Gender |
M=Male, F=Female |
|
Height |
Numeric |
Height (cm) |
|
Blanks=missing data |
Weight |
Numeric |
Weight (kg) |
|
Blanks=missing data |
Proc_date |
Date |
Procedure date |
|
|
systolic_BP |
Numeric |
Systolic blood pressure |
|
Blanks=missing data |
diastolic_BP |
Numeric |
Diastolic blood pressure |
|
Blanks=missing data |