In this article, I am going to share a easy and quick way to delete some duplicate rows (e.g. regarding column B) from a spreadsheet but the requirement is to keep the row with the maximum value (e.g. regarding column G).
If you convert the spreadsheet into *.csv (comma separated values), which are just plain text basically, then you can use many other programming languages easily that read text files and parse them into columns of values.
If the given files are in the format of *.XLS or *.XLSX, then it may not be so easy to read/understand these formats. Now my recommendation would be to use the VBScript and make use of the Excel.Application COM object.
To set up the object and read the spread sheet.
Dim objXL
Set objXL = WScript.CreateObject("Excel.Application")
objXL.Visible = TRUE
objXL.Workbooks.Open "data.xlsx"
To check for duplicate rows (regarding column B) while considering the maximum value for column G. Mark rows to delete with ‘X’ (column A).
row = 2 ' skip first line which is column names
Do While True
curtime = Trim(objXL.Range("B" & CStr(row)))
If (Len(curtime) = 0) Then
Exit Do
End If
curg = CDbl(objXL.Range("G" & CStr(row)))
j = row + 1
Do While True
nexttime = Trim(objXL.Range("B" & CStr(j)))
If (Len(nexttime) = 0) Then
Exit Do
End If
If (nexttime = curtime) Then
nextg = CDbl(objXL.Range("G" & CStr(j)))
If (nextg > curg) Then ' mark for deletion later
curg = nextg
objXL.Cells(row, 1) = "x"
Else
objXL.Cells(j, 1) = "x"
End If
Else
Exit Do
End If
j = j + 1
Loop
row = j
Loop
To delete these rows (the ones marked with ‘X’), it should be done in backward order.
For i = row To 2 Step -1
If objXL.Cells(i, 1) = "x" Then
objXL.Rows(i).Delete
End If
Next
Please note that if you run the Excel.Application in background (objXL.Visible = False), then you have to put objXL.Quit otherwise when the script quits, the Excel Process will still live in memory (background process) and you can’t control it, which is a memory leak unless you terminate it in the process monitor.
The full script source code with better user experiences (support command line parameter) is:
' helloacm.com
' check command line parameter
If WScript.Arguments.Count = 0 Then
WScript.Echo "File not given."
WScript.Quit
End If
CSV = Trim(WScript.Arguments.Item(0))
WScript.Echo "Processing " & CSV
' check if file is valid
Set fso = CreateObject("Scripting.FileSystemObject")
If (Not fso.FileExists(CSV)) Then
WScript.Echo "File not found: " & CSV
WScript.Quit
End If
Dim objXL
Set objXL = WScript.CreateObject("Excel.Application")
' in background
objXL.Visible = false
objXL.Workbooks.Open CSV
row = 2
Do While True
curtime = Trim(objXL.Range("B" & CStr(row)))
If (Len(curtime) = 0) Then
Exit Do
End If
curg = CDbl(objXL.Range("G" & CStr(row)))
j = row + 1
Do While True
nexttime = Trim(objXL.Range("B" & CStr(j)))
If (Len(nexttime) = 0) Then
Exit Do
End If
If (nexttime = curtime) Then ' considered duplicate
nextg = CDbl(objXL.Range("G" & CStr(j)))
If (nextg > curg) Then ' keep the maximum
curg = nextg
objXL.Cells(row, 1) = "x"
Else
objXL.Cells(j, 1) = "x"
End If
Else
Exit Do
End If
j = j + 1
Loop
row = j ' jump to next row
Loop
For i = row To 2 Step -1
If objXL.Cells(i, 1) = "x" Then
objXL.Rows(i).Delete
End If
Next
objXL.DisplayAlerts = False ' Default Option
objXL.ActiveWorkbook.SaveAs CSV & "_new.csv"
WScript.Echo "New file saved to: " & CSV & "_new.csv"
' avoid memory leaks
objXL.Quit
Set fso = Nothing
Set objXL = Nothing
The example output:
D:\>cscript csv.vbs D:\data.csv Microsoft (R) Windows Script Host Version 5.8 Copyright (C) Microsoft Corporation. All rights reserved. Processing D:\data.csv New file saved to: D:\data.csv_new.csv D:\>
–EOF (The Ultimate Computing & Technology Blog) —
Last Post: C/C++ Function to Compute the Combination Number
Next Post: Interview Question: The Number of Bits Required to Convert One Integer to Another (C/C++ function)